AD-A204  147 


imc  f n  F  nnpv 

RADC-TR-88-155 
Final  Technical  Report 
July  1988 


DISTRIBUTED  SYSTEM  RECOVERY 
MECHANISMS 


Honeywell  Inc. 


Anand  R.  Tripathi,  Halmut  K.  Berg,  Jonathan  Silverman,  William  T.  Wood,  Elaine  N.  Frankowtki, 
Pong-Sheng  Wang,  Shiva  Azadegan,  Shiv  Seth  end  Rita  Wu 


APPROVED  FOR  PUBLIC  RELEASE;  DISTRIBUTION  UNLIMITED. 


ROME  AIR  DEVELOPMENT  CENTER 
Air  Force  Systems  Command 
Griff Iss  Air  Force  Base,  NY  13441-5700 


DTIC 

ELECTE 


§9  1  04  043 


This  report  has  been  reviewed  by  the  RADC  Public  Affairs  Division  (PA) 
and  is  releasable  to  the  National  Technical  Information  Service  (NTIS).  At 
NTIS  it  will  be  releasable  to  the  general  public,  including  foreign  nations 

RADC-TR-88-155  has  been  reviewed  and  is  approved  for  publication. 


APPROVED 


Thomas  f.  Lawrence 

Project  Engineer 


APPROVED : 


RAYMOND  P.  URTZ,  JR. 

Technical  Directo»- 
Directorate  of  Command  &  Control 


FOR  THE  COMMANDER; 


Directorate  of  Plans  &  Programs 


If  your  address  has  changed  or  if  you  wish  to  be  removed  from  the  RADC 
mailing  list,  or  if  the  addressee  is  no  longer  employed  by  your  organization, 
please  notify  RADC  ( COTD  )  Griffiss  AFB  NY  13441-5700.  This  will  assist  us 
in  maintaining  a  current  mailing  list. 


Do  not  return  copies  of  this  report  unless  contractual  obligations  or  notices 
on  a  specific  doucment  require  that  it  be  returned. 


REPORT  DOCUMENTATION  PAGE 


form  Apfifovtd 
OMf  No  0704-0)1 


)a.  RfPORT  security  CLASSIFICATION 

iraCLASSIFlES 


2a.  SECURITY  CLASSIFICATION  AUTHORITY 

N/A 


2b.  DECLASSIFICATION /OOWNCRAOING  SCHEDULE 

N/A 


4.  performing  organization  report  NUMIER(S) 

N/A 


6a  NAME  OF  PERFORMING  ORGANIZATION 

Honeywell  Inc. 


6b  office  symbol 

(1/  •pfUktbi*) 


6c  ADORES 


8a.  NAME  OF  FUNDING  /  SPONSORING 
ORGANIZATION 

Rome  Air  Development  Center 


3c.  ADDRESS  (C/ry.  Srata.  and  Z/PCoda; 

Grifflss  AFB  NY  13441-5700 


• '  ''TlE  finc/uda  Stturity  Oituficitian} 


8b  OFFia  SYMBOL 
(If  appMcaMa) 
COTD 


lb  RESTRICTIVE  MARKINGS 

H/A 


DISTRIBUTION /availability  OF  REPORT 

jjpproved  for  public  release; 
dletribation  unllaltad. 


S  MONITORING  organization  REPORT  NUMBER(S) 

RADC-TR-86-1SS 


7a.  NAME  OF  MONITORING  ORGANIZATION 

Rome  Air  Development  Center  (COTD) 


7b  ADDRESSLCty.  Stata.  andZ/PCoda; 

Grifflss  AFB  NY  13441-5700 


9  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 

F30602-82-C-0154 


10  SOURCE  OF  FUNDING  NUMBERS 


PROJEa 

TASK 

WORK  UNIT 

NO 

NO 

ACCESSION  NO 

2530 

01 

17 

DISTRIBUTED  C  SYSTEM  RBCWERY  MECHANISMS 


13a.  type  of  report 
Final 


16  supplementary  notation’' Subcon  tractors: 
James  C.  Browne,  James  Dutton,  Vincent 
University  of  Texas  at  Austin  -  Author 


17  COSATI  COOES 


FIELD  I  GROUP  |  SUB-GROUP 


Berg,  Jonathan  Silverman,  William  T.  Wood, 
Shiva  Azadegan.  Shiv  Seth.  Rita  WU  * 


14.  DATE  OF  REPORT  (rtar,  MentFi,  Oa^  T<5  PACE  COUNT 

July  1988  I  178 


abstract  (Conrinua  on /avaria  if  nacasary  and  (deftd/y  by  b/oc*  nvmbtr) 

US  report  describes  an  effort  to  develop  a  system  designers  guidebook  for  designing 
reliable  distributed  command  and  control  systems.  The  guidebook  contains  a  synthesis  of 
reliable  system  design  principles  and  methods  to  evaluate  distributed  system  designs  for 
performance,  reliability,  and  functional  correctness.  The  approach  to  developing  the  system 
designers  guidebook  In  this  effort  is  example  driver.  We  develop  a  detailed  design  of  a 
reliable  distributed  operating  system  and  evaluate  its  performance. 


20  DISTRIBUTION /AVAILABILITY  OF  ABSTRACT 
_ E  UNCLASSIFlED/UNLIMITEO _ □  SAME  AS  RPT 


224  NAME  QF  RESPONSIBLE  INDIVIDUAL 

Thomas  F.  LasTrence 


□  DTIC  USERS 


21.  ABSTRAa  security  CLASSIFICATION 

UNCLASSIFIED 


:  CLASSIFICATION  QF  THIS  PA 


00  Form  1473,  JUN  86 


Previous  •tfiYioni  art  obtoittt 


UNCLASSIFIED 


Block  16  (Ctmtiauad) 

Fault-Toleranc  Systems  Volidetlon 

Verification  Sepltcatloo 

Cooalt  Protocol  Oa^fn  Methods 

Object-Oriented  Systcaw  Formal  Specification 

Block  16  (Continued) 

Richard  M.  Cohen,  Lawrence  Smith,  Lawrence  Akers,  Vllllam  Bevier,  Mlren  Carranza, 
Ann  Slebert 


UNCLASSIFIED 


CONTENTS 


I 

,  Page 

I 

INTRODUCTION  .  1 

DISTRIBUTED  COMMAND  AND  CONTROL  SYSTEMS  .  5 

2.1  Command  and  Control  Systems  .  5 

2.1.1  Command  and  Control  System  Function  .  5 

2.1.2  Operational  Environment  .  6 

2.2  Architecture  Of  Distributed  Systems  .  7 

2.3  Reliable  System  Requirements  .  9 

2.4  Design  Issues  And  Tradeoffs  .  11 

2.5  Summary .  14 

INTEGRITY  MECHANISMS  .  15 

3.1  Consistency  Management  In  Distributed  Systems  .  16 

3.1.1  Timestamp  Based  Protocols  .  16 

3.1.2  Locking  Protocols . . .  17 

3.1.3  Optimistic  Concurrency  Control  .  I8 

3.1.4  Basic  Timestamp  Ordering  Versus  Locking  .  19 

3.1.5  Non-Serial  Consistency  .  20 

3.2  Reliability  Techniques  In  Distributed  Systems  .  20 

3.2.1  Error  Detection  Techniques  .  20 

3.2.2  Error  Recovery  Techniques  .  21 

3.2.2. 1  Checkpointing  and  Rollback  .  22 

3. 2. 2. 2  Careful  Replacement  .  22 

3. 2. 2. 3  Logs/Audit  Trail .  24 

3. 2. 2. 4  Commit  Protocols  and  Atomic  Actions  .  25 

3. 2. 2. 5  Replication  Management  in  Distributed  Systems  .  25 

3. 2.2.6  Network  Partitioning  and  Continued  Operations  .  28 

3.3  Summary .  29 

ABSTRACT  DISTRIBUTED  SYSTEM  ARCHITECTURE  .  31 

4.1  System  Structuring  Concepts  .  31 

4.2  A  Design  Model  For  Reliable  Distributed  Systems  .  32 

4.3  A  Model  Of  An  Object  Oriented  Reliable  Distributed  System  37 

4.3.1  Structure  of  Object-Oriented  Distributed  Systems  .  .  39 

4.3.2  Functions  of  the  Type  Managers .  39 

4.3.3  Structure  of  Type  Managers .  40 

4.3.4  Distributed  Types  .  42 

4.4  Summary .  42 

ZEUS:  AN  EXAMPLE  SYSTEM  DESIGN .  44 

5.1  Structure  of  the  Zeus  System .  46 

5.1.1  Structure  of  the  Zeus  Kernel .  46 

5.1.2  System- Defined  Type  Managers .  -48 

5.1.3  Process  Management  .  48 


I 


CONTENTS  (cont) 


Page 

5.2  Formal  Definitions  of  the  Designs .  52 

5.3  Summary .  54 

ANALYSIS  AND  VALIDATION  TECHNIQUES  .  55 

6.1  Introduction .  55 

6.2  Performance  Evaluation  Of  Recovery  Mechanisms  .  55 

6.2.1  Performance  Measures  .  56 

6.2.2  Models  and  Hierarchical  Structuring  .  56 

6.2.3  Parts  of  a  Performance  Model:  System,  Environment,  58 

Workload  . 

6.2.3. 1  Analytic  Methods  .  58 

6. 2. 3. 2  Simulation  Methods  .  59 

6. 2. 3. 3  Hybrid  Methods  .  59 

6.2.4  Performance  Measures  for  Recovery  Mechanisms  ....  59 

6.2.5  Example  Metrics  for  Some  Generic  Integrity  Mechanisms  60 

6.2.5. 1  Transaction  Mechanisms  .  60 

6. 2. 5. 2  Object  Replication  .  61 

6.2.6  Zeus  Performance  Modeling .  61 

6.2.7  Summary .  62 

6.3  Reliability  Analysis  Techniques  .  62 

6.3.1  Specifications  of  Reliability  Measures  .  63 

6.3.2  Network-Based  Reliability  Model  .  64 

6.3.3  CONCLUSIONS .  67 

6.4  Validation  And  Verification  Techniques  .  67 

6.4.1  Proofs  of  Recovery  Mechanisms  using  Gypsy  .  68 

6.4. 1.1  Introduction  .  68 

6.4. 1.2  Gypsy  Support  for  the  Specification  of  Recovery  69 

Meehans ims  . 

6.4. 1.3  Specifications  and  Proofs  of  Recoverable  Objects  69 

6.4. 1.4  Recovery  Scenario  for  Atomic  Actions  .  70 

6 . 4 . 1 . 5  Summary  .  70 

6.4.2  Recovery  Mechanism  Proofs  using  Interval  logic  ...  71 

6.4.2. 1  Proofs  of  Global  Assertions  .  71 

6.4.3  Functional  Simulation  of  Fault  Tolerance  .  73 

6.4.3. 1  Issues  in  Simulating  Fault  Tolerance  .  74 

6. 4. 3. 2  Summary  .  76 

PERFORMANCE  EVALUATION  OF  THE  ZEUS  SYSTEM  .  78 

7.1  Model  Overview .  79 

7.1.1  Model  Environment  .  79 

7.1.2  Model  System  Structure  .  80 

7.1.3  Model  Workload .  80 

7.2  Goals  of  the  Example  Evaluation .  83 

7.3  Summary  of  the  Evalution  Data .  84 

FUTURE  DIRECTIONS  .  87 

8.1  System  Structuring  .  87 


CONTENTS  (cont) 


Page 

8.2  Analysis  and  Validation .  89 

8.3  Formal  Methods  for  Design  Definition  .  90 

8.4  System  Designers  Workbench  .  90 

8.5  Recommendations .  90 

REFERENCES .  91 

Appendix  I .  95 

CSDL;  CONCURRENT  SYSTEM  DEFINITION  LANGUAGE  .  96 

1.1  INTRODUCTION .  96 

1.2  MODELS .  99 

1.2.1  Computational  Models  .  99 

1.2. 1.1  Sequential  Computations  .  99 

1.2. 1.2  Concurrent  Computations  .  100 

1 .2. 1 .3  Histories  .  101 

1.2.2  System  Model  . .  102 

1.3  METHODOLOGY .  102 

1.3.1  Constructive  Correctness  .  103 

1.3.2  Object  Orientation  .  104 

1.3.3  Complexity  Management  .  105 

1.3.4  Linguistic  Support  for  the  Design  Guidelines  ....  105 

1.4  CONSTRUCTS  AND  NOTATION .  107 

1.4.1  System  Definition  Structure  .  107 

1.4.2  Procedures .  108 

1.4. 2.1  Algorithmic  Language  .  108 

1.4. 2. 2  Atemporal  Specification  ...  .  109 

1.4.3  Data .  110 

1.4.3. 1  Passive  Data .  Ill 

1.4.3. 1.1  Built-in  Passive  Types  .  Ill 

1.4. 3. 1.2  Abstract  Data  Types  .  112 

1.4. 3. 1.2.1  Type  Definitions .  113 

1.4. 3. 1.2. 2  Type  Refiners  .  114 

1.4. 3. 2  Active  Data .  115 

1.4. 3. 2.1  Complementarity  .  II6 

1 .4. 3. 2. 2  Connection  .  117 

1.4. 3. 2. 3  Inlets  and  Outlets  .  117 

1 .4.4  Machines .  1 18 

1.4. 4.1  Machine  Definitions  .  119 

1.4. 4. 2  Machine  Realizations  .  119 

1.4. 4. 3  Dynamism  .  120 

1.4. 4. 3.1  Machine  Creation  . .  120 

1.4. 4. 3. 2  The  Need  for  Pools .  121 

1.4. 4. 3. 3  The  Role  of  Public  Objects .  121 

1.4. 4. 3. 4  Communication  .  122 

1.4. 4. 3. 4.1  Linking  .  123 

1.4. 4. 3. 4. 2  Information  Flow  When  Forging  123 

Communication  Links  . 

i  i  i 


COMTENTS  (cont) 


Page 

1.4. 4. 4  Temporal  Specifications  .  124 

1.4.5  Documentation  Format  .  126 

1.5  EXAMPLE .  127 

1.6  DISCUSSION .  136 

Appendix  II .  137 


/ 


CHAPTER  1 


INTRODUCTION 


Reliability  and  timeliness  are  the  two  most  Important  and  critical 
attributes  of  command  and  control  systems.  The  command  and  control  system 
functions  Involve  control  of  defense  system  resources  and  communication  of 
intelligence  and  command  information  among  the  various  constituents  of  the 
system.  This  requires  timely  collecting,  processing,  and  communicating  large 
amount  of  information  to  ensure  effective  coordination  among  the 
geographically  dispersed  components  of  the  command  and  control  system. 

Distributed  system  technology  provides  an  important  and  attractive 
approach  to  supporting  the  operations  of  the  future  command  and  control 
systems.  This  technology  potentially  supports: 

1.  Dispersion  of  the  data  as  well  as  the  processing  functions  to  various 
locations  of  the  command  and  control  system. 

2.  Redundancy  of  data  and  functions  to  improve  the  reliability  of  the  system 
due  to  the  multiplicity  of  processing  resources. 

3.  Efficient  communication  of  information  by  networking  of  processing 
resources. 

The  goal  of  this  contract  is  to  synthesize  a  set  of  techniques  for  building 
reliable  distributed  systems  for  command  and  control  applications  and  to 
evaluate  their  designs  for  fault- tolerance,  reliability  and  performance.  This 
work  involved  study  of  the  system  recovery  mechanisms  for  distributed  systems, 
development  of  concepts  for  integrating  them  into  a  distributed  operating 
system,  and  finally  a  set  of  methods  for  evaluating  the  performance  and 
reliability  of  such  designs.  The  final  outcome  of  this  contract  is  a 
two-volume  system  designers  guidebook  titled  A  DESIGNERS  GUIDE  TO  RELIABLE 
DISTRIBUTED  SYSTEMS.  The  first  volume  of  this  guidebook  presents  the  design 
and  analysis  methods,  and  the  second  volume  contains  the  detailed  designs  of 
an  example  distributed  operating  system  called  Zeus  and  and  its  performance 
evaluation  data. 

The  design  of  a  distributed  system  involves  many  complex  decisions.  The 
purpose  of  a  designers  guidebook  is  to  help  a  designer  in  systematically 
addressing  the  various  design  issues  and  making  the  most  appropriate  decisions 
so  that  the  final  design  meets  the  desired  requirements.  It  is  important  to 
stress  the  distinction  between  a  guidebook  and  a  handbook.  A  guidebook 
provides  a  comprehensive  set  of  procedures  which  can  aid  a  designer  in 
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achieving  a  goal.  A  handbook  provides  a  coeprehensive  set  of  results  (e.g., 
tables)  which  provide  a  basis  froa  which  a  designer  nay  nake  design  decisions 
for  a  specific  application.  It  is  appropriate  to  write  a  handbook  if  one  has 
the  details  of  a  set  of  applications  of  Interest  and  the  associated  systen 
environments.  A  guidebook  is  applicable  to  a  larger  set  of  problems  and 
designers  because  of  its  orientation  to  procedures  rather  than  results. 

A  guidebook  describes  the  steps  which  take  a  designer  from  a  set  of 
requirement  statements  to  a  detailed  systen  design  which  would  exhibit  the 
desired  operational  characteristics  in  a  specified  implementation  base.  Each 
design  step  refines  the  design  and  further  defines  what  are  the  system's 
operational  attributes.  One  set  of  attributes  are  those  associated  with  the 
fault  tolerance  of  a  systen  —  availability,  reliability,  and  survivability. 
An  example  of  the  design  decisions  that  must  be  made  are  the  degree  of 
availability  required  for  a  given  application  and  the  performance  required  of 
a  system  environment  to  achieve  it.  It  is  a  well-established  principle  that 
the  designs  should  be  subjected  to  early  evaluations  before  starting  auiy 
implementations.  In  fact,  the  design  steps  and  the  evaluation  steps  should 
proceed  in  a  closely  coupled  fashion.  This  book  presents  a  set  of  design 
guidelines  for  constructing  fault  tolerant  distributed  systems  and  a  set  of 
procedures  for  evaluating  the  desired  operational  characteristics  of  such 
designs. 

The  main  contribution  of  this  research  is  a  unified  presentation  of 
system  recovery  mechanisms,  a  framework  for  their  integration,  and  a  set  of 
evaluation  techniques.  It  provides  a  starting  point  for  the  development  of  a 
design  methodology  of  fault- tolerant  distributed  systems. 

The  system  designers  guidebook  is  organized  into  two  volumes.  The  first 
volume  describes  reliability  mechanisms,  a  framework  for  expressing  designs, 
and  techniques  for  evaluating  mechanisms.  There  are  two  classes  of  problems 
that  are  not  addressed  in  the  reliability  mechanism  discussion  —  security  and 
Byzantine  agreement.  These  problems  were  deemed  outside  of  the  scope  of  this 
contract.  A  framework  based  on  object-oriented  design  is  defined  and  used  for 
expressing  designs  because  it  motivates  the  discussion  of  reliability 
mechanisms  and  aids  in  their  integration  into  a  unified  design  model.  An 
example  distributed  operating  systen  called  Zeus  is  derived  from  the  framework 
and  used  as  a  basis  for  presenting  and  demonstrating  analysis  techniques. 
This  example  design  Illustrates  the  integration  of  recovery  mechanisms  into 
distributed  system  designs.  Zeus  should  be  regarded  as  a  design  framework 
rather  than  a  point  solution.  Although  the  mechanisms,  techniques,  and 
results  are  described  within  the  context  of  an  object-oriented  design,  they 
are  equally  applicable  to  process-oriented  designs.  Volume  II  contains  the 
complete  details  of  the  example  system  and  the  results  of  its  analysis.  Some 
familiarity  with  Ada  [Do083](1}  and  the  Concurrent  System  Definition  Language 
[FRANSBa]  is  required  to  understand  the  detailed  designs.  The  definition  of 
the  C/3ncurrent  System  Definition  Language  is  Included  as  an  appendix  to  this 
repor*t. 


(1)  Ada  is  a  registered  trademark  of  the  U.S.  Government,  Ada  Joint  Program 
Office 
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The  approach  taken  in  developing  the  systea  designers  guidebook  Is 
depicted  In  Figure  1>1.  The  systea  recovery  aechanisas  are  integrated  into 
the  Zeus  design.  Concurrent  Systea  Definition  Language  (CSDL)  Is  used  for  the 
formal  definition  of  the  designs.  PANS  (Perforaance  Analysts  Workbench 
System),  Gypsy,  NetRAT,  and  Path  Pascal  are  used  to  evaluate  the  Zeus  designs. 
The  methodology  used  In  the  design,  the  analysis  process,  and  the  evaluation 
results  are  documented  in  the  systea  designers  guidebook. 
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This  report  presents  a  concise  yet  complete  overview  of  the  technical 
approach  taken  during  the  course  of  this  prograa  and  the  highlights  of  the 
Important  technical  accompllshaents.  Each  chapter  of  this  report  describes  an 
important  milestone  In  the  course  of  this  contract,  our  approach  in  achieving 
the  milestone,  the  uniqueness  of  the  approach,  and  finally  the  major 
accomplishments.  One  of  the  highlights  of  our  approach  in  developing  the 
systea  designers  guidebook  is  the  definition  and  design  of  an  example 
distributed  operating  system  and  the  application  of  a  set  of  design  evaluation 
techniques  using  this  example  system  as  a  testbed. 

A  systea  design  can  only  be  done  in  the  context  of  its  application 
environment.  For  this  reason  we  pursued  the  task  of  studying  the  operational 
environment  and  the  functional  requirements  of  distributed  command  and  control 
systems.  These  results  of  our  study  are  described  in  the  second  chapter  of 
this  report.  The  main  emphasis  of  this  study  was  on  the  recovery  mechanisma 
for  distributed  command  and  control  system.  One  of  the  tasks  was  to  survey 
the  these  recovery  mechanisms  and  provide  a  comprehensive  description  of  the 
mechanisms  in  the  system  designers  guidebook.  An  Important  outcome  of  this 
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task  was  an  object-oriented  design  eodel  for  building  reliable  distributed 
systems.  This  design  model  integrates  the  surveyed  recovery  mechanisms  into 
one  frameirork.  A  brief  overview  of  the  survey  is  presented  in  Chapter  3.  The 
object  oriented  design  model  and  the  example  system,  called  Zeus,  which  is 
based  on  this  model,  are  described  in  Chapter  4  and  5>  Chapter  5  is  devoted 
to  the  detailed  designs  of  the  example  system  and  the  formal  definition  of 
such  designs.  In  this  context  we  discuss  the  work  performed  on  Concurrent 
System  Definition  ^nguage  (CSDL).  The  description  of  various  analysis 
tools/techniques  form  an  important  part  of  the  system  designers  guidebook.  In 
this  contract  work  we  focused  on  using  PAHS  (Performance  Analysts  Workbench 
SystemXl)  for  performance  evaluations,  NetRAT  for  reliability  evaluations, 
Gypsy  for  formal  correctness  proofs,  and  Path  Pascal  for  functional  simulation 
for  validating  fault- tolerance.  Chapter  6  describes  the  highlights  of  our 
work  in  the  development  and  application  of  these  techniques  to  reliable 
distributed  systems.  Chapter  7  presents  the  goals  of  the  performance  modeling 
of  the  Zeus  system  using  PAWS,  approach  for  evaluations,  and  the  summary  of 
the  simulation  results.  Some  of  the  possible  future  directions  for  this  work 
Include  a  detailed  study  of  recovery  mechanisms  in  a  process-oriented  design, 
design  of  fault- tolerant  real-time  systems,  experimental  evaluations  of 
recovery  mechanisms  in  the  context  of  the  object  oriented  framework  developed 
under  this  contract,  or  development  of  an  object  oriented  general  purpose 
distributed  operating  system  such  as  Zeus.  Chapter  8  is  devoted  to  the 
possible  future  directions  for  this  work. 


(1)  PAWS  is  a  registered  trademark  of  Information  Research  Associates,  Austin, 
Texas. 
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CHAPTER  2 


DISTRIBUTED  COtMAND  AND  CONTRX  SYSTEMS 


A  systea  design  can  be  aeanlngfully  and  successfully  carried  out  only  In 
the  context  of  its  intended  application  environaent.  A  thorough  understanding 
of  the  applications  is  esaentlal  in  order  to  sake  the  requlreeent  stateeenta 
for  the  systea  functionality,  reliability,  and  performance.  The  syatea 
functionality  and  the  associated  reliability  statements  Identify  the  kinds  of 
failures  the  systea  aust  withstand  and  the  consistency  that  aust  be  maintained 
for  the  system  objects  in  the  presence  of  failures  and  concurrent  operations. 
The  performance  statements  identify  the  desired  response  time  and  throughput 
of  the  various  system  functions  under  the  specified  workload.  The  results 
discussed  here  describe  the  operational  characteristics  and  the  functional 
requirements  of  distributed  coaaand  and  control  systems,  and  also  identify  the 
forms  of  requirement  statements  for  systea  performance  and  reliability.  The 
results  of  this  effort  are  applicable  to  both  strategic  and  tactical  coaaand 
and  control  systems. 

2.1  Command  and  Control  Systems 

Any  coaaand  and  control  systea  aust  support  four  basic  functions: 
communication,  navigation,  data  collection  and  decision  support.  These 
systems  can  be  divided  into  two  broad  categories,  strategic  and  tactical. 
Systems  in  these  two  categories  differ  in  the  geographic  scope  of  the  systea, 
their  functional  complexity  and  the  mobility  of  the  systea  nodes.  Strategic 
command  and  control  systea  encompass  a  relatively  large  region  of  operations 
(roughly  500  to  1,000  miles  radius).  It  maintains  large  long-lived  databases 
and  contains  several  smaller,  and  possibly  tactical,  coaaand  and  control 
systems  as  its  constituents.  Tactical  coaaand  and  control  systeas  are 
generally  smaller  in  geographic  scope;  the  distance  between  nodes  is  typically 
10-200  miles.  The  nodes  of  the  tactical  systeas  are  relatively  mobile  -  they 
can  be  moved  and  installed  in  a  few  days.  The  coaaunications  facilities  that 
connect  nodes  of  a  .jctical  syatea  are  usually  auch  less  reliable  than  those 
used  to  connect  the  nodes  of  strategic  systems. 

2.1.1  Command  and  Control  Systea  Function 

The  general  goals  for  the  data  processing  eleaents  in  a  coaaand  and 
control  systea  are  to: 

0  Make  information  available  to  the  users  who  need  it. 

0  Improve  the  response  tine  of  time-sensitive  operations. 
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o  Support  the  database  needs  of  the  users. 

0  Hake  available  global  databases  which  are  needed  for  planning, 
coordination,  threat  assessaent,  targeting.  Intelligence  production  and 
status  aonltorlng. 

o  Provide  reliable  dlssealnatlon  of  eessages  carrying  requlreaents, 
coaaands,  warnings  and  status  Inforaatlon. 

o  Provide  extensive  degraded  aode  operating  capability. 

0  Provide  enhanced  survivability  and  continuous  operation  under  the  loss 
of  C2  systea  coaponents. 

o  Support  Bultl>user  aultl-level  security  of  inforaatlon. 

Efficient  database  sharing  Is  the  aost  crucial  requlreaent  of  coaaand  and 
control  systeas.  A  coaaand  and  control  systea  aust  support  global  logical 
objects  for  the  following  kinds  of  Inforaatlon:  weather,  personnel, 
logistics,  eneay  situation,  friendly  situation,  surveillance  and 
identification,  warnings  and  alerts,  alsslon  status,  tactical  air  support 
requests,  etc.  The  aajor  role  of  the  data  processing  functions  perforaed  by  a 
coaaand  and  control  systea  are  concerned  with  aalntaining  this  data  base  and 
providing  tiaely  and  accurate  reports  using  the  database. 

2.1.2  Operational  Envlronaent 

Because  of  the  evolutionary  nature  of  future  distributed  C2  systeas.  It 
Is  desirable  to  adopt  an  approach  which  peralts  relatively  easy  changes  for 
systea  expansions,  capacity  upgrades,  functionality  upgrades,  hardware 
substitution,  and  addition  of  new  eleaents.  The  approach  of  nodular  systea 
design  should  also  help  In  rapidly  configuring  new  systeas. 

Instead  of  designing  systeas  to  aeet  certain  specific  requireaents ,  It  Is 
desirable  to  provide  an  architecture  which  can  adapt  easily  to  the  long  tera 
changing  requlreaents  due  to  the  state  of  the  technology  and  the 
world-situation,  as  well  as  the  short- tera  changes  In  requlreaents  due  to  the 
tactical  envlronaent.  One  can  use  physical  and  coaaunlcatlon  envlronaent 
features  of  distributed  coaaand  and  control  systeas  to  characterize  their 
operational  envlronaent. 

Physical  Envlronaent; 

A  single  coaaand  and  control  systea  consists  of  several  geographically 
dispersed  coaaand  centers.  The  distance  between  the  units  can  range  froa  a 
few  Biles  to  a  few  hundred  ailes  for  tactical  systeas,  and  up  to  a  few 
thousand  ailes  for  strategic  systeas.  ITie  geographical  dispersion  serves  to 
Increase  the  field  of  view  or  to  provide  higher  survivability  to  the  coaaand 
centers  by  locating  thea  In  rear  areas. 

Comaunlcatlon  between  the  coaaand  centers  In  a  C2  systea  can  be 
impleaented  with  aicrowave  or  radio  frequency  channels.  In  soae  cases,  where 
the  distances  are  not  too  large,  coaxial  cables  or  fiber-optics  cables  aay  be 
used.  The  coaaand  centers  are  high  value  targets,  and  placing  thea  In  the 
rear  areas  for  reasons  of  survivability  will  decrease  the  perforaance  because 
of  coaaunlcation  delays. 
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Most  of  the  couunlcation  aaong  and  within  constituent  coaaand  centers  of 
a  conuBand  and  control  systee  consists  of  coaaand  aessages,  and  database 
updates  and  query  aessages.  Host  of  the  other  data  processing  requlreaents  of 
a  coamand  center  will  noraally  be  supported  by  the  resources  co>located  within 
it. 


Host  of  the  laportant  databases  critical  to  the  coaaand  and  control 
operations  are  maintained  at  the  coaaand  centers.  To  support  continued 
operations  in  the  event  of  the  loss  of  a  coaaand  center,  another  center  aust 
be  able  to  reconstruct  the  database  froa  the  replicated  coaponents  of  the 
global  database. 

Communications  Envlronaent: 

The  comaunlcatlons  within  a  coaaand  center  will  be,  in  general,  supported 
by  a  local  area  network  (LAN)  and  that  aaong  the  coaaand  centers  will  be 
supported  by  long*haul  networks.  Thus  a  long*haul  network  connects  several 
LANs  in  a  C2  systea.  The  long-haul  network  topology  will  change  dynaaically 
because  mobile  coaaand  centers  will  be  aoved  in  response  to  the  tactical 
situation.  The  length  of  the  a  LAN  coaaunication  link  can  range  froa  a  few 
meters  to  hundreds  of  meters.  The  long-haul  links  can  be  up  to  about  thousand 
miles  long.  The  conaunication  bandwidth  for  LANs  ranges  between  1-10  ab/sec, 
and  for  long-haul  communication  froa  10-50  kb/sec. 

The  long-haul  network  aust  be  connected  to  external  eleaents  such  as  the 
World  Wide  Military  Comaand  and  Control  Systea  (VfWMCCS),  Intelligence  Data 
Handling  System  Comaunications  (IDMSC)  and  the  Defense  Coaaunications  System 
(DCS). 

Some  of  the  biggest  problems  which  will  affect  the  coaaunications  systea 
performance  are  electronic  warfare,  self-Jaaaing,  and  the  loss  of  nodes 
(mlninets).  Network  partitioning,  node  drop-out,  node  reunion,  network 
reconfiguration  are  some  of  the  problems  which  the  designers  must  address. 

2.2  Architecture  Of  Distributed  Systeas 

Redundancy  of  both  hardware  and  software  resources  is  the  most  laportant 
characteristic  of  reliable  systems  that  support  continued  operations  despite 
component  losses.  The  geographical  distribution  of  critical  systea  databases 
and  processing  resources  is  key  to  the  design  of  survlvable  systeas.  Thus,  in 
the  event  of  loss  of  a  particular  site  in  a  comaand  and  control  systea,  it 
should  be  possible  to  use  database  copies  and  processing  resources  at  other 
sites.  The  need  to  maintain  and  update  replicated  databases  iaposes  the 
following  requirements  for  the  underlying  architecture: 

The  system  must  contain  appropriate  processing  resources  (CPU,  meaory, 
secondary  storage)  at  each  site. 

There  must  be  comaunicatlon  between  the  sites  as  well  as  with  local  and 
remote  users  of  the  databases  (a)  to  keep  the  replicated  copies  autually 
consistent  and  (b)  to  provide  access  to  remote  users  on  system 
reconfiguration. 
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These  two  requireeents  sake  distributed  systee  architectures  the  most 
natural  candidates  for  supporting  highly  survivable  C2  systeas.  Functional 
redundancy  and  geographical  dispersion  enables  distributed  systeas  to  survive 
hostile  actions  and  to  provide  continuous  operation.  These  advantages  of 
distributed  systeas  arise  partially  froa  the  distribution  of  systea  state 
Inforaation.  HoHever«  effective  survivability  aechanisas  are  based  upon 
consistent  systea  state.  Distributed  operating  systeas  used  in  this 
application  aust  Incorporate  aechanisas  to  aaintaln  the  consistency  of  the 
distributed  systea  state  Inforaation  in  the  presence  of  concurrent  updates  and 
systea  coaponent  failures.  This  is  essential  to  gtiarantee  correct  functioning 
on  reconfiguration  and  restart;  therefore,  suitable  recovery  aechanisas  and 
concurrency  control  aechanisas  are  required  in  the  distributed  operating 
systea  to  aaintaln  consistency  of  distributed  state  variables. 

A  distributed  systea  consists  of  aultiple  coaputers  interconnected  by  a 
coaaunication  network  that  cooperate  to  coaplete  a  coaputation.  The 
aechanisas  that  enable  the  cooperation  are  iapleaented  by  a  distributed 
operating  systM.  k  conceptual  picture  of  a  distributed  operating  systea  is 
shown  in  Figure  2-1. 
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A  site  consists  of  at  least  one  physical  processor,  an  operating  systea 
kernel,  primary  and  possibly  secondary  memory,  an  Interface  to  the 
communication  network,  and  possibly  interfaces  to  Input/output  facilities. 
The  sites  are  physically  separated  and  communication  occurs  by  message 
exchange  rather  than  by  shared  memory.  Each  site  has  pi-ocesses  and  resources 
which  constitute  fragments  of  systea  processing  activities.  Since  control  of 
these  processing  activities  is  distributed  aaong  the  sites,  a  single  site 
normally  has  neither  system-wide  authority  nor  a  complete  view  of  the  global 
system  state. 

A  distributed  operating  system  creates  and  manages  logical  (perhaps 
physically  distributed)  resources  (processes  and  files)  and  physical  resources 
(processors  and  memories).  A  distributed  operating  system  is  based  on  a  set 
of  protocols  which  govern  interaction  between  sites.  The  operating  system 
kernel  at  each  site  manages  its  physical  resources  autonomously  and  may 
cooperate  with  other  kernels  in  the  management  of  its  logical  resources.  The 
state  information  may  be  partitioned  and  distributed  among  the  operating 
systea  kernels.  The  individual  kernels  operate  concurrently,  and  possibly 
asynchronously  on  the  basis  of  locally  available  state  information.  The 
system  interface  consists  of  a  set  of  functions  which  the  distributed 

operating  system  provides  to  the  application  environment.  Ideally,  these 

interfaces  should  have  the  following  features: 

1.  Transparency  of  resource  locations:  The  user-visible  functions  for 

accessing  the  resources  in  the  system  should  make  transparent  to  the 
clients  the  location  of  the  resources.  The  mechanisms  for  accessing  the 
remote  and  the  local  resources  should  be  uniform. 

2.  Transparency  of  recovery  mechanisms:  The  interfaces  provide  by  the 

distributed  operating  system  should  make  the  recovery  mechanisms 
transparent  to  the  clients;  however,  the  clients  should  have  enough 
control  over  the  selection  of  those  parameters  that  are  critical  to  the 
system  performance  in  different  application  environments. 

A  communication  system  transfers  information  among  the  sites  in  a 
distributed  system.  It  is  used  by  distributed  operating  system  kernels, 
system  processes,  and  application  processes  to  convey  updates  and  to  gain 
access  to  global  system  state  and  to  utilize  resources  provided  by  other 
sites.  Communication  systems  typically  appear  in  system  designs  and 
implementations  at  a  higher  level  of  abstraction  because  the  detailed 
realization  is  isolated  from  the  remainder  of  the  system.  In  the  context  of  a 
command  and  control  system,  a  communication  system  is  meant  to  include  an 
internet,  a  collection  of  interconnected  networks. 

2.3  Reliable  System  Requirements 

An  application  may  be  described  as  a  collection  of  objects  on  which 
operations  are  executed  by  users.  Some  operations  may  be  combined  together 
and  executed  as  a  single  end-user  operation.  A  requirements  statement  nay  be 
made  about  the  performance  and  reliability  of  the  operations  as  described 
below. 
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Performance  Requlrementa;  In  general  the  performance  requirements  are 
specified  in  terms  of  the  response  time  and  throughout.  There  are  several 
ways  in  which  these  two  measures  may  be  specified  fr  r  a  u. 'trlbuted  system. 
Average  throughput  and  average  response  tine  for  the  execution  of  a  given 
operation  on  an  object  can  be  specified  in  the  following  terns: 

(1)  For  the  overall  system, 

(2)  For  some  particular  sites  in  the  system, 

(3)  For  each  operational  mode  such  as  emergency /peace- tine  nodes, 

(4)  As  a  function  of  available  resources  (sites)  in  the  system. 

In  addition  to  the  mean  values,  upper  and  lower  bounds  or  variances  may  be 
specified  for  these  measures. 

Reliability  Requirements;  Traditionally  the  reliability  requirements  for  a 
service  or  operation  are  specified  in  terms  of  its  expected  availability  and 
mean-tlme-to-fallure.  Like  the  performance  requirements  specifications,  the 
reliability  requirements  can  also  be  specified  in  terms  of  the  four  ways 
described  above.  It  may  also  include  the  types  of  failures  that  must  be 
withstood,  and  the  number  of  failures  of  a  given  type  that  must  be  withstood. 

Similar  requirements  may  be  made  about  groups  of  operations.  In 
addition,  it  is  assumed  that  statements  are  made  about  what  the  hardware 
configuration  is  and  the  assignment  of  objects  and  operations  to  sites.  From 
such  information  we  are  interested  in  answering  two  questions:  "What  level  of 
reliability  does  a  system  provide?",  and  "What  is  the  extra  cost  of  the 
reliability?". 

Ideally  a  user  should  be  able  to  specify  the  desired  reliability 
requirements  for  objects,  operations,  and  groups  of  operations  without  knowing 
the  details  of  the  Implementing  mechanisms;  tools  should  then  automatically 
configure  a  system.  More  realistically,  a  system  administrator  who  is 
knowledgeable  about  the  system's  hardware  and  software  will  manually  make  the 
selections  and  adjustments  needed  to  achieve  the  desired  level  of  reliability. 

In  order  to  state  system  requirements  and  to  develop  a  system  that  meets 
them  we  are  still  faced  with  a  problem  of  how  to  specify  the  requirements. 
Traditionally,  component  reliability  is  given  by  statistical  quantities  such 
as  the  mean  time  to  failure  (MTTF),  mean  time  to  repair  (MTTR),  and  the 
probability  of  availability.  This  suggests  that  one  way  for  a  user  to  specify 
the  reliability  of  objects  and  operations  is  to  give  the  desired  values 
(e.g.,0.95  MTTF).  This  is  certainly  the  most  accurate  way  of  defining  the 
expected  reliability  of  an  object  since  it  takes  into  account  the 
interrelation  among  all  of  the  dependent  objects  and  components  of  a  system 
and  their  individual  characteristics.  There  are,  however,  at  least  three 
problems  with  using  statistical  quantities  for  user  level  specifications. 

First  is  the  problem  of  using  small  quantities  to  specify  values.  Should 
the  MTTF  be  0.94  or  0.95?  Why?  Would  different  users  choose  different  values 
for  similar  objects? 

Secondly,  there  are  typically  many  different  combinations  of  parameters, 
replication  strategies,  and  configurations  which  will  yield  the  same  or 
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roughly  the  saae  MTTF  and  availability  for  a  given  object.  A  slaple  nuaerlcal 
quantity  gives  no  Indication  as  to  which  of  several  possible  strategies  to 
choose.  Furthermore,  without  being  given  additional  information,  it  Is 
difficult  for  an  administrator  and  probably  Impossible  for  the  system  Itself 
to  choose  the  appropriate  solution. 

The  third  problem  arises  due  to  the  application  environment  which  Is 
being  considered  In  this  guidebook.  The  probabilities  of  component  failures 
may  change  unpredlctably  under  military  stress  conditions.  The  MTTF  and 
availability  of  a  component  completely  describe  Its  fault  characteristics. 
There  are  many  circumstances,  however,  where  these  metrics  are  difficult  or 
impossible  to  evaluate.  As  an  example  of  such  a  circumstance,  consider  a 
command  and  control  system  in  a  potential  combat  situation.  The  definition  of 
a  component  "fault”  in  such  a  system  would  have  to  include  the  destruction  or 
disruption  of  that  component  in  combat;  and  thus,  this  eventuality  must  be 
taken  into  account  when  calculating  the  MTTF  and  availability  of  the 
component.  Unfortunately,  the  probability  of  an  attack,  or  the  probability 
that  an  attack  will  ensue  in  such  a  way  as  to  affect  the  performance  of  a 
given  component  of  the  system,  depends  on  a  number  of  decidedly 
non-quantifiable  factors  such  as  political  climate,  human  factors,  recent 
history,  and  so  on.  In  such  conditions  it  is  more  reasonable  to  ask  questions 
such  as  "Does  this  service  (function)  remain  available  given  that  a  set  of 
components  are  unavailable?". 

The  alternative  to  using  statistical  quantities  provides  a  set  of 
pre-defined  reliability  levels.  Associated  with  each  reliability  level  is  a 
consistency  (integrity)  specification.  The  levels  overcome  the  problems  with 
strictly  numerical  specifications  by  associating  a  boolean-valued  consistency 
requirement.  An  object  is  said- to  belong  to  a  certain  reliability  level  for  a 
given  set  of  faults  if  the  associated  consistency  specifications  are 
maintained  under  the  presence  of  those  faults.  The  object  is  viewed  as  being 
"completely  immune"  to  the  faults  in  that  set  for  the  associated  consistency 
level.  The  maximum  cardinality  of  such  a  set  for  a  given  reliability  level  of 
an  object  determines  the  robustness  of  that  object.  Faults  may  include  events 
such  as  site  failures,  link  failures,  disk  failures,  and  memory  failures. 
Each  category  can  be  refined  when  it  is  appropriate  to  do  so.  For  example 
disk  failures  can  be  refined  to  include  single  page  failures  and  disk  pack 
failures. 

2.4  Design  Issues  And  Tradeoffs 

Effectiveness  of  a  recovery  mechanism  can  be  measured  in  terms  of 
recovery  time  and  performance  overhead.  To  see  why  the  recovery  time  and 
performance  overhead  are  important  in  evaluating  the  recovery  mechanism, 
consider  the  performance  of  a  system  under  normal  and  faulty  conditions. 
Assume  that  throughput  (defined  as  the  number  of  units  of  work  performed  per 
unit  of  time)  is  an  indication  of  the  system  performance.  If  there  were  no 
failures,  there  would  be  no  need  for  recovery  mechanisms.  In  this 
hypothetical  situation,  under  a  constant  load  (e.g.,  fixed  number  of  jobs 
running  in  the  system  at  all  times)  the  throughput  stays  constant  at  a  level 
that  is  referred  to  as  the  ideal  level  of  performance.  Introducing  recovery 
mechanisms  into  this  system  to  enable  It  to  deal  with  failures  degrades  the 
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performance  even  when  there  are  no  failures.  An  operating  overhead  is  imposed 
equal  to  ( 1 )  the  processing  overhead  required  to  check  and  maintain 
information  about  system  state  for  recovery  and  (2)  a  storage  overhead  equal 
to  the  storage  required  to  hold  redundant  information.  It  is  desirable  to 
choose  those  recovery  mechanisms  that  have  the  least  performance  overhead 
under  normal  operation. 

When  an  error  condition  occurs,  certain  recovery  procedures  are 
initiated.  These  procedures  cause  an  even  higher  performance  overhead.  This 
is  called  failure  recovery  operation  overhead.  After  the  fault  is  eventually 
cleared  and  the  system  is  recovered,  the  performance  goes  back  to  the  level 
before  the  failure.  Figure  2-2  depicts  this  simplified  situation.  There  are 
two  Important  parameters  that  have  to  be  considered  when  a  failure  occurs. 
First,  how  much  time  does  it  take  for  the  system  to  recover  from  the  failure? 
This  period  of  time  is  called  system  recovery  time.  For  the  duration  of  the 
system  recovery  time,  the  performance  of  the  system  is  at  its  lowest  level. 
Therefore,  a  good  recovery  mechanism  has  to  minimize  this  time  period. 
Second,  how  much  is  the  performance  of  the  system  degraded  for  the  duration  of 
the  system  recovery?  The  performance  overhead  factor  Includes  both  the  normal 
operation  overhead  and  the  failure  recovery  overhead. 

A  more  realistic  situation  is  depicted  in  Figure  2-3.  The  system 
operates  normally  until  a  fault  occurs  and  some  component  of  the  system 
becomes  inoperative.  The  system  executes  recovery  procedures  and  operates  at 
reduced  capacity.  After  the  recovery  procedures  are  executed,  the  performance 
rises  to  a  level  below  full  system  performance.  Later,  the  fault  is  cleared 
and  the  system  executes  recovery  procedures  to  restore  the  consistency  of 
global  system  state.  During  this  time,  performance  is  again  degraded. 
Finally,  throughput  is  restored  to  the  normal  level. 

This  view  introduces  additional  effective  measures:  Reduced 
configuration  overhead  is  the  difference  between  ideal  performance  and 
performance  while  part  of  the  system  is  inoperative.  Reconfiguration  recovery 
overhead  is  the  difference  between  ideal  performance  and  performance  while 
global  system  state  is  being  restored.  Reconfiguration  time  is  the  duration 
of  this  processing. 

Cost  is  another  important  factor  in  deciding  which  recovery  mechanisms 
should  be  Included  in  a  distributed  system.  Cost  may  be  measured  in  terms  of 
the  additional  hardware  resources  required  to  implement  a  recovery  mechanism 
while  maintaining  the  same  level  of  performance  as  without  the  recovery 
mechanisms.  This  includes  the  cost  of  additional  primary  and  secondary 
storage  and  processing  power.  The  memory  requirement  is  derived  from  the  size 
of  the  recovery  mechanism  procedures  and  the  size  of  any  additional  data 
structures.  The  secondary  storage  requirements  may  be  further  Increased  if 
they  are  required  to  store  multiple  copies  of  objects.  The  additional 
processing  overhead  is  derived  from  the  performance  overhead  previously 
discussed.  Another  way  to  characterize  the  overhead  due  to  recovery 
mechanisms  is  in  terms  of  reduction  in  response  time  and  throughput. 
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2.5  Summary 

The  system  designers  guidebook  presents  the  results  of  our  study  of 
functional  requirements  of  distributed  command  and  control  systems.  In 
designing  such  systems  it  Is  important  to  understand  their  operational 
environment  In  order  to  define  the  performance  and  reliability  requirements. 
Distributed  system  architectures  seem  to  be  the  most  Ideal  and  natural  choice 
for  Implementing  the  future  command  and  control  systems.  This  Is  due  to  the 
fact  that  such  architectures  support  Integration  of  geographically  dispersed 
processing  elements  Into  one  coherent  monolithic  system.  This  integration  Is 
achieved  by  a  distributed  operating  system  which  provides  mechanisms  for 
managing  distributed  resources  in  the  system.  Transparency  of  resources  and 
the  recovery  mechanisms  for  the  resource  management  functions  are  two 
Important  attributes  of  an  ideal  distributed  operating  system.  Introduction 
of  recovery  mechanisms  Introduces  certain  performance  penalties  such  as 
reduced  throughput  and  response  time  because  of  extra  resources  and  CPU  cycles 
required  for  maintaining  additional  system  state  needed  for  recovery. 
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CHAPTER  3 


INTEGRITY  MECHANISMS 


The  operating  conditions  that  exist  in  a  distributed  system  define 
requirements  for  the  consistency  and  reliability  management  techniques.  These 
conditions  include  concurrent  operations  and  component  failures.  Concurrent 
operations  may  access  common  data  and  inadvertently  compromise  the  integrity 
of  the  data.  If  there  are  multiple  copies  of  data,  the  problem  of  concurrent 
access  must  address  the  issue  of  the  interdependency  of  the  copies  values. 
Whenever  components  fall  or  if  users  are  permitted  to  abort  operations  all 
other  sites  that  are  executing  a  part  of  the  operation  must  be  informed  and 
data  restored  to  a  consistent  state.  If  a  user's  computation  is  dependent  on 
an  Intermediate  value  of  a  failed  or  aborted  computation,  the  dependent 
computation  may  also  have  to  be  rolled  back.  It  is  possible  that  a  cascade  of 
rollbacks,  a  domino  effect,  may  occur. 

The  consistency  requirements  in  a  distributed  system  are  characterized  by 
four  criteria.  The  first  criterion,  internal  consistency,  is  the  semantic 
integrity  of  the  data.  The  second  criterion,  mutual  consistency,  is  the 
relation  between  the  copies  of  replicated  distributed  data.  One  example  of  a 
mutual  consistency  requirement  is  that  all  copies  of  a  replicated  data 
converge  to  the  same  value  sometime  after  the  updating  of  data  is  stopped. 
The  third  criterion,  external  consistency,  is  the  relation  of  the  system 
interactions  with  the  users.  For  example,  if  a  user  invoking  a  transaction  is 
given  a  response  indicating  successful  completion,  then  the  updates  made  by 
the  transaction  must  be  reflected  in  the  database.  The  external  consistency 
requirements  are  dependent  upon  the  definition  of  the  user-system  interface. 
Interactive  consistency  [LAMP82],  the  fourth  criterion,  requires  that  all 
correctly  functioning  nodes  in  the  system  have  an  identical  view  of  the  system 
despite  the  malfunctioning  of  some  nodes.  This  is  also  known  as  the  Byzantine 
Generals  Problem. 

The  key  principles  for  designing  reliable  systems  are  the  atomicity  of 
transactions  and  the  management  of  redundant  components.  A  transaction  is  a 
set  of  primitive  operations  on  data  that  appears  to  be  executed  as  an 
indivisible  operation.  A  transactions  implementation  in  a  distributed  system 
requires  a  protocol  by  which  a  collection  of  processes  may  reliably  decide  to 
make  permanent  (i.e.  commit")  its  effects.  Transactions  provide  a  common 
work  unit  for  the  problems  of  error  recovery  and  synchronization.  The 
techniques  to  solve  these  two  problems  in  a  design  interact  closely  with  each 
other  because  of  the  need  to  maintain  recoverable  consistent  states. 
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Managenent  of  redundancy  in  the  systeo  in  the  form  of  replication  of 
objects  or  creation  of  backup  objects  is  important  for  supporting  continued 
operations  in  the  event  of  loss  of  resources.  The  major  problem  in  redundancy 
management  is  the  maintenance  of  consistency  among  replicated  objects,  and  the 
maintenance  of  sufficient  up-to-date  state  information  with  the  backup  modules 
to  support  reconfiguration.  Several  strategies  may  be  used  to  manage  such 
state  Information,  for  example  keeping  a  majority  or  a  survivable  set  of  the 
replicated  units  in  a  consistent  state. 

This  chapter  introduces  the  terminology,  concepts,  and  Issues  Involved  in 
consistency  and  reliability  managenent  techniques.  The  details  of  the 
algorithms  for  implementing  the  techniques  are  contained  within  the  system 
designers  guidebook. 


3.1  Consistency  Management  In  Distributed  Systems 


The  goal  of  concurrency  control  techniques  is  to  maintain  mutual, 
internal,  and  external  consistency  requirements  of  shared  data  and  to  maximize 
the  throughput  of  access  to  the  data.  The  techniques  used  for  maintaining 
consistency  of  data  under  concurrent  update  operations  consist  of  four  tasks. 
The  first  is  to  assign  an  order  to  all  the  transactions.  The  second  Is  to 
Identify  conflicting  transactions  and  conflicts.  The  third  is  to  realize  the 
inter-site  synchronization  required  to  achieve  this  order  for  the  conflicting 
transactions.  The  fourth  is  to  achieve  .the  required  intra-site 
synchronization.  The  schedule  produced  may  be  serializable  or 
non-serlalizable.  A  serializable  schedule  means  that  the  final  effect  of 
executing  interleaved  operations  of  concurrent  transactions  on  tne  database  is 
equivalent  to  some  serial  execution  order  of  those  transactions.  A 
non-serializable  scheduler  seeks  to  increase  the  concurrency  between 
transactions  by  examining  the  semantics  of  operations.  A  serializable 
scheduler  uses  only  the  syntax  of  a  transaction.  Almost  all  systems  to  date 
use  serlalizab  3  schedulers.  There  are  three  basic  techniques  to  achieve 
serial  consistency:  timestamps,  locks,  and  optimistic. 


3.1.1  Timestamp  Based  Protocols 


In  times tamp- based  protocols,  every  transaction  and  every  data  item  is 
assigned  a  globally  unique  timestamp.  The  timestamp  of  the  datum  is  equal  to 
the  timestamp  of  the  last  transaction  that  accessed  the  datum.  To  access  a 
datum,  a  transaction  sends  its  timestamp  and  the  type  of  operation  (e.g.,  read 
or  write)  to  the  site  where  the  datum  resides.  In  order  to  serialize  requests 
and  resolve  conflicts  a  scheduler  at  the  site  where  the  datum  resides  uses  a 
rule  to  compare  the  timestamp  and  operation  request  of  the  transaction  with 
the  timestamp  of  the  datum. 

A  number  of  timestamp  based  protocols  have  been  proposed.  In  general, 
the  greater  the  amount  of  concurrency  permitted,  the  greater  the  probability 
that  an  operation  may  be  rejected  and  a  transaction  restarted.  Basic 
timestamp-ordering  and  conservative  timestamp-ordering  are  the  endpoints  of  a 
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spectrum.  Basic  timestamp-ordering  delays  operations  very  little,  but  it 
tends  to  reject  many  operations.  It  schedules  a  transaction's  operation  if 
its  timestamp  is  greater  than  the  timestamp  of  the  datum.  Conservative 
timestamp-ordering  never  rejects  operations,  but  it  tends  to  delay  them  often. 
It  requires  that  a  scheduler  have  an  operation  request  from  every  other  node 
before  a  request  is  granted.  Since  it  has  a  request  from  all  nodes,  it  can 
safely  allow  the  request  with  the  smallest  timestamp  to  proceed. 


3.1.2  Locking  Protocols 


In  locking  protocols,  a  transaction  requests  a  lock  on  an  object, 
operates  on  the  object  only  when  it  has  been  granted  a  lock,  and  releases  the 
lock  on  the  object  when  it  no  longer  needs  the  object.  The  exact  time  that  a 
transaction  releases  a  lock  is  dependent  on  how  the  logical  database  is 
organized.  If  the  logical  database  has  the  structure  of  a  directed  graph  a 
transaction  may  release  an  object  as  soon  as  its  operation  on  the  datum  is 
completed,  otherwise  it  must  wait  until  there  are  no  other  locks  to  be 
acquired.  The  latter  case  is  called  two-phase  locking  and  the  former  non 
two-phase  locking. 

Two-Phase  Locking  Protocols 

A  two-phase  locking  protocol  specifies  that  in  each  transaction  all  the 
locking  operations  must  precede  any  unlocking  operation,  and  all  transactions 
must  be  well-formed.  A  well-formed  transaction  acquires  locks  on  objects 
before  accessing  them.  It  has  been  shown  in  [ESWA76]  that  if  all  transactions 
follow  the  two-phase  locking  protocol,  then  the  schedules  of  their  executions 
are  serializable.  In  the  two-phase  locking  it  is  easy  to  see  that  deadlocks 
are  possible.  To  avoid  deadlocks  we  could  set  an  order  to  all  the  entities 
and  stipulate  that  all  the  transactions  request  locks  only  in  the  set  order. 
Alternatively,  when  a  transaction  has  been  permitted  to  start  executing,  it 
may  put  intention  locks  on  all  the  entities  it  would  ever  need.  These  locks 
may  be  used  to  rule  out  the  possibility  of  a  deadlock  before  permitting  any 
other  transaction  to  set  its  intention  locks. 

Another  approach  to  ensure  deadlock  freedom  is  to  add  a  deadlock 
prevention  scheme  to  the  locking  scheme.  Rosenkrantz,  et  al.,  [R0SE78]  have 
proposed  two  such  deadlock  prevention  schemes.  Timestamps  are  assigned  to 
transactions  and  are  used  as  priorities  in  determining  what  to  do  when  a 
transaction  requests  a  lock  on  an  object  that  is  already  locked.  In  the 
Wait-Die  scheme,  an  older  transaction  waits  on  the  completion  of  a  younger 
transaction  that  holds  a  resource  that  the  older  transaction  requests;  but  a 
younger  transaction  that  requests  a  resource  held  by  an  older  transaction  is 
forced  to  restart.  In  the  Wound-Wait  scheme,  an  older  transaction  waits  on  a 
younger  transaction  only  if  the  younger  transaction  has  started  its 
termination;  otherwise,  the  younger  transaction  is  restarted.  A  younger 
transaction  is  allowed  to  wait  on  an  older  transaction. 

The  Wait-Die  scheme  has  the  disadvantage  that  a  younger  transaction  may 
restart  and  die  several  times  before  completing  successfully.  The  restarts 
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will  consume  some  of  the  system  resources.  However,  this  scheme  has  the 
advantage  over  the  Wound-Wait  scheme  that  after  a  transaction  has  acquired  all 
of  the  resources  it  needs,  it  can  not  be  pre-empted  and  restarted.  In  the 
Wound-Wait  scheme,  even  when  a  transaction  has  locked  all  the  resources  it 
needs,  but  has  not  yet  initiated  its  termination,  it  is  possible  that  for  it 
to  be  wounded  and  forced  to  restart. 

Non-Two-Phase  Locking 

Only  a  few  protocols  have  been  proposed  that  are  not  two-phase  locking. 
One  of  these,  proposed  in  [SILB81]  presumes  a  tree-structured,  hierarchically 
organized  database.  Transactions  must  be  well  formed  and  locks  are  acquired 
as  follows.  A  transaction  Ti  may  initially  request  a  lock  at  any  node  (e.g., 
entity).  Subsequent  lock  requests  may  be  made  only  for  direct  descendants  of 
nodes  for  which  Ti  already  has  a  lock.  When  a  lock  is  released,  it  may  not  be 
reacquired.  The  schedules  produced  by  this  protocol  are  serializable  and, 
unlike  the  two-phase  locking  protocols,  are  deadlock  free.  An  intuitive 
understanding  of  this  fact  is  straightforward.  Each  transaction  has  a 
frontier  of  lowest  nodes  in  the  tree  on  which  it  holds  the  locks.  The 
protocol  guarantees  that  these  frontiers  do  not  overlap.  If  the  frontier  of 
Ti  begins  above  the  frontier  of  Tj,  it  will  remain  so,  and  every  item  to  be 
locked  by  both  will  be  locked  by  Tj  first. 

When  locks  are  used  in  a  distributed  system  a  number  of  additional 
considerations  arise.  Among  them  are  is  global  synchronization  required  to 
Lock  an  object,  how  is  an  object  globally  synchronized,  how  can  global 
synchronization  be  achieved  with  a  minimal  number  of  messages,  and  how  can  any 
one  node  be  kept  from  becoming  a  performance  bottleneck  and  a  single  point  of 
failure.  These  issues  are  discussed  at  length  in  the  system  designers 
guidebook. 


3.1.3  Optimistic  Concurrency  Control 


The  optimistic  method  for  concurrency  control  [KUNG8I]  hopes  that 
transaction  conflict  is  rare  and  that  concurrency  can  be  increased  by 
eliminating  locks  and  their  associated  overhead.  Every  transaction  goes 
through  three  phases  —  read,  validation,  and  write.  During  the  read  phase,  a 
transaction  reads  objects,  creates  local  copies  of  the  objects,  and  updates 
the  local  copies.  The  validation  phase  determines  if  the  operations  of  a 
transaction  conflict  with  those  of  another  transaction  and  violate  serial 
consistency  requirements.  If  the  test  fails,  the  transaction  is  aborted. 
Otherwise  a  transaction  enters  a  write  phase  where  its  updates  are  made 
permanent  or  the  results  of  a  query  are  displayed. 

In  order  to  ensure  the  serializability  of  the  transaction,  each 
transaction  is  assigned  a  unique  integer  and  the  transactions  are  serialized 
according  to  their  assigned  numbers.  The  validation  Procedure  ensures  that 
one  of  the  following  three  conditions  holds:  1)  a  transaction,  Ti,  with  a 
smaller  assigned  number  completes  its  write  phase  before  a  transaction,  Tj, 
with  a  larger  assigned  number  starts  its  read  phase;  2)the  write  set  of  Ti 
does  not  intersect  the  read  set  of  Tj,  and  Ti  completes  its  write  phase  before 
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TJ  starts  its  write  phase;  and  3)the  write  set  of  Ti  does  not  intersect  the 
read  set  or  the  write  set  of  TJ  and  Ti  coepletes  its  read  phase  before  TJ 
completes  its  read  phase. 

Condition  (1)  states  that  TI  actually  coapletes  before  TJ  starts. 
Condition  (2)  states  that  the  writes  of  Ti  do  not  affect  the  read  phase  of  TJ, 
and  that  Ti  finishes  writing  before  TJ  starts  writing,  hence  does  not 
overwrite  TJ  (also,  note  that  TJ  cannot  affect  the  read  phase  of  TI). 
Finally,  condition  (3)  is  similar  to  condition  (2)  but  does  not  require  that 
Ti  finish  writing  before  TJ  starts  writing;  it  siaply  requires  that  Ti  not 
affect  the  read  phase  or  the  write  phase  of  TJ  (again  note  that  TJ  cannot 
affect  the  read  phase  of  Ti,  by  the  last  part  of  the  condition). 

The  transactions  are  assigned  their  transaction  nuabers  after  they 
complete  the  read  phase  to  avoid  the  possibility  of  a  more  recent  transaction 
with  a  short  read  phase  being  blocked  by  an  earlier  transaction  with  a  long 
read  phase.  This  scheme  of  assigning  transaction  numbers  does  not  require  the 
validation  of  condition  (3)  above. 


3.1.M  Basic  Timestamp  Ordering  Versus  Locking 


Timestamp  ordering  in  centralized  systems  tends  to  behave  very  similar  to 
locking  but  has  the  disadvantage  of  inducing  larger  nuabers  of  restarts.  This 
is  because  the  timestamp  ordering  scheme  a  priori  determines  the  serialisation 
order.  What  nay  appear  to  be  a  transaction  conflict  that  induces  a  restart 
based  on  timestamp  ordering  may  not  be  a  conflict  using  locking  and  optimistic 
methods.  For  example,  if  a  transaction  with  a  larger  timestamp  reads  an 
object  and  completes  before  a  transaction  with  a  smaller  timestamp  writes  the 
same  object,  the  transaction  with  the  smaller  timestamp  will  be  aborted. 
Locking  and  optimistic  schemes  would  allow  both  transactions  to  complete 
successfully. 

Locks  are  required  to  implement  critical  sections  in  both  timestamp 
ordering  and  optimistic  schemes.  For  example,  locks  are  required  while 
reading  and  updating  the  timestamps  associated  with  the  objects.  More 
importantly,  in  the  timestamp  ordering  scheme  some  fora  of  logical  locking  is 
required  to  prevent  triggered  aborts.  Such  a  situation  arises  when  a  more 
recent  transaction  is  allowed  to  read  objects  that  have  been  updated  by  a 
transaction  that  is  uncommitted  and  later  aborts.  The  more  recent  transaction 
is  also  aborted.  To  prevent  such  a  situation,  access  to  an  updated  object  by 
other  transactions  is  blocked  until  the  updating  transaction  either  commits  or 
aborts.  This  is  equivalent  to  holding  a  write  lock  on  the  object.  It  should 
be  noted  here  that  the  optimistic  scheme  avoids  locking  and  accepts  the 
possibility  of  transaction  aborts. 
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3-1.5  Non-Serial  Consistency 


It  Is  only  recently  that  researchers  [GARC83b]  [FISIA21  [BLAU83I  have 
started  Investigating  consistency  aanageeent  techniques  that  exploit  the 
semantic  knowledge  of  the  database  during  concurrency.  Such  a  knowledge  can 
lead  to  certain  acceptable  schedules  that  are  not  serializable.  This  area  of 
research  is  relatively  unexplored. 

In  [CARC83b]  Carcia-Molina  investigates  how  the  semantic  knowledge  of  an 
application  can  be  used  In  a  distributed  database  to  process  transactions 
efficiently  and  to  avoid  soae  of  the  delays  associated  with  failures.  In 
[GARC83]i  the  main  idea  is  to  allow  nonserial izable  schedules  which  preserve 
consistency  and  which  are  acceptable  to  the  system  users.  To  produce  such 
schedules,  the  transaction  processing  mechanism  receives  semantic  Information 
from  the  users  in  the  fora  of  transaction  semantic  types,  a  division  of 
transactions  into  steps,  compatibility  sets,  and  countersteps.  Using  these 
notions,  in  [CARCd3l«  mechanism  is  proposed  which  allows  users  to  exploit 
their  semantic  knowledge  in  an  organized  fashion. 


3.2  Reliability  Techniques  In  Distributed  Systems 


In  this  section  we  review  various  error  recovery  techniques  and  their 
applicability  in  distributed  systems.  Our  discussion  of  recovery  techniques 
starts  with  a  brief  overview  of  the  concepts  and  definitions  in  this  area. 
Detailed  discussions  of  these  concepts  and  definitions  can  be  found  in  some  of 
the  surveys,  CRAND78]  [KOHL8I]  CVERH78],  in  this  area. 

A  system  is  said  to  have  failed  when  it  no  longer  aeets  its  specifications. 
The  transition  into  the  failed  state  is  characterized  by  the  failure  event. 
The  term  error  is  used  to  characterize  an  incorrect  system  such  that  any 
further  computation  activity  using  the  normal  algorithms  would  result  in  a 
failure  of  the  system.  A  fault  is  the  mechanical  or  algorithmic  malfunction 
(i.e.,  failure)  of  a  system  component  that  may  cause  an  erroneous  state. 

All  reliability  techniques  are  based  on  adding  redundancy  in  the  system 
to  support  recovery  from  errors  and  continued  operation.  This  is  called 
protective  redundancy.  It  is  manifested  in  a  system  as  additional  components, 
data,  and  algorithms.  This  section  discusses  the  additional  components,  data, 
and  algorithms  necessary  to  do  error  detection  and  recovery  in  a  distributed 
system. 


3.2.1  Error  Detection  Techniques 


The  purpose  of  error  detection  techniques  is  to  detect 
states  of  the  system  that  could  lead  to  system  failures, 
techniques  for  error  detection  [ANDE79]  are  described  below. 


the  erroneous 
Some  general 
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(a)  Replication  Checks:  In  such  scheaes,  an  activity  Is  replicated  and  the 
results  froe  replicated  activities  are  checked  for  consistency.  An 
inconsistency  aeong  results  Indicates  a  possible  error  condition.  Errors 
can  be  masked  by  majority  voting  as  in  Triple  Modular  Redundant  systems. 

(b)  Reversal  Checks:  They  Involve  application  of  Inverse  computation  to 
check  what  the  Input  to  the  system  should  have  been.  The  calculated 
input  and  the  actual  input  are  compared  for  consistency. 

(c)  Coding  Checks:  They  are  the  most  popular  error  detection  technique. 
Redundant  information  in  the  form  of  checksum  or  parity  is  associated 
with  objects  to  detect  erroneous  states. 

(d)  Acceptance  Tests/Consistency  Checks:  At  certain  well-defined  points  in 
the  execution,  tests  are  applied  to  the  objects  that  define  the  state  at 
that  point.  Such  tests  ensure  that  the  state  at  that  point  conforms  to 
certain  specifications.  Any  inconsistencies  Imply  an  erroneous  state. 
Consistency  checks  can  also  be  applied  to  some  mutilated  data  structures 
that  are  reconstructed  on  recovery. 

(e)  Interface  Tests:  These  tests  ensure  that  the  Interactions  among  system 
components  meet  certain  acceptance  criteria.  Tests  are  applied  to  the 
parameters  and  the  results  of  Interface  functions.  Such  tests  limit 
propagation  of  errors  from  one  component  to  another  through  the 
interfaces.  The  confinement  of  errors  is  strongly  dependent  on  how 
rigorous  the  acceptance  tests  are.  In  distributed  systems,  interfaces 
provide  well-defined  and  controlled  means  for  the  propagation  of 
exception  conditions  between  modules.  If  the  interface  function 
execution  encounters  error  conditions,  then  an  error  condition  is 
returned  to  the  caller  through  the  interface. 

(f)  Diagnostic  Checks:  In  such  techniques,  explicit  tests  are  conducted  on 
system  components  for  which  expected  outputs  for  given  test  inputs  are 
known.  The  failures  of  components  to  be  tested  and  the  components 
conducting  the  tests  should  be  independent.  As  pointed  out  in  [ANDE79], 
diagnostic  tests  are  rarely  used  as  a  primary  error  detection  mechanism, 
rather  used  as  a  supplement  to  other  detection  mechanisms. 

(g)  Interval  Timer /Time-Out  Mechanisms:  In  distributed  systems,  time-out 
techniques  are  frequently  used  to  detect  possible  error  conditions.  A 
process  invoking  a  remote  operation  waits  for  a  certain  specified  period 
(called  the  time-out  period)  to  receive  the  response.  If  no  response  is 
received  within  this  period,  then  an  exception  condition  is  raised  and 
appropriate  forward  error  recovery  is  initiated. 


3.2.2  Error  Recovery  Techniques 


Recovery  techniques  involve  the  generation  of  consistent  system  states. 
There  are  two  categories  of  techniques:  backward  error  recovery  and  forward 
error  recovery.  Backward  error  recovery  techniques  save  prior  consistent 
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states  in  an  execution  history.  When  an  error  is  detected  recovery  involves 
restoring  a  computation  to  a  saved  prior  consistent  state. 
Checkpoint/rollback  Is  typical  of  these  techniques.  The  forward  error 
recovery  techniques  use  the  present  computation  and  error  state  to  arrive  at  a 
new  consistent  state.  They  are  typified  by  programming  language  exception 
handlers.  The  techniques  in  the  latter  category  are  application  dependent, 
while  those  in  the  former  are  application  independent.  We  will  restrict  our 
discussion  to  backward  error  recovery  techniques. 


3.2.2. 1  Checkpointing  and  Rollback 


In  this  technique,  the  state  of  the  process  is  saved  on  a  stable  storage 
as  a  checkpoint.  A  checkpoint  is  a  backup  version  of  the  complete  execution 
environment  of  the  process.  When  the  system  is  recovering  from  an  error  a 
checkpoint  of  the  system  (e.g.,  the  state  of  all  processes  executing  at  a 
given  time)  is  loaded  and  restarted. 

The  rollback  of  a  process  in  a  system  of  communicating  processes  may 
cause  rollback  of  other  processes.  This  happens  when  a  process  that  is  rolled 
back  to  a  previous  checkpoint  has  communicated  some  information  to  some  other 
processes  after  establishing  that  checkpoint.  Thus,  all  messages  sent  after 
that  checkpoint  are  revoked,  and  all  activities  performed  by  the  recipient 
processes  after  receiving  such  messages  are  Invalid;  this  causes  all  recipient 
processes  to  also  roll  back  to  their  respective  checkpoints  established  before 
receiving  these  messages.  This  can  cause  a  cascade  of  rollback  activities,  a 
phenomenon  referred  to  as  the  domino  effect.  The  domino  effect  can  be  avoided 
if  the  way  that  processes  interact  is  controlled.  One  way  is  to  restrict 
process  interaction  to  accessing  shared  objects  within  the  context  of  a 
transaction  and  the  appropriate  concurrency  control  and  commit  protocols. 


3. 2. 2. 2  Careful  Replacement 


A  key  issue  in  the  development  of  reliable  systems  is  the  saving  of 
consistent  system  states.  The  state  of  a  system  evolves  in  a  number  of 
volatile  main  memory  pages.  At  some  point  in  time,  a  consistent  state  is  to 
be  saved  on  non-volatile  storage.  The  problem  arises  as  to  how  to  update  the 
version  of  the  system  state  on  the  non-volatile  storage  in  such  a  way  that  if 
a  crash  was  to  occur  in  the  midst  of  the  update  there  would  be  a  consistent 
system  state  available  on  the  non-volatile  storage  when  recovery  begins. 

The  main  issue  is  how  pages  on  volatile  storage  are  mapped  to  pages  on 
non-volatile  storage.  There  are  two  possible  mappings  —  direct  and  Indirect. 
In  direct  mapping  there  is  a  one-to-one  relationship  between  volatile  and 
non-volatile  storage  pages.  Objects  are  updated  "in  place."  If  a  crash 
occurs  in  the  midst  of  an  update,  an  inconsistent  state  may  exist.  Indirect 
mapping  uses  techniques  that  aviod  a  one-to-one  relationship. 

The  careful  replacement  technique  updates  a  copy  of  the  original  object. 
The  original  copy,  also  called  the  "shadow"  copy,  remains  unaffected  in  case 
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of  failures  during  the  updating  procedure.  Only  on  conunltnent  is  the  shadow 
copy  replaced  by  the  updated  copy. 

An  exanple  of  this  technique  is  the  scheme  proposed  by  Lampson  and 
Sturgis  [LAMP81]  for  making  page  write  operations  atomic  in  order  to  implement 
a  stable  storage  facility.  The  Put  and  Get  operations  on  a  physical  disk  are 
not  atomic  in  the  sense  that  a  crash  of  the  system  during  the  put  operation 
for  a  page  may  leave  that  page  only  partially  updated.  A  CarefulPut  operation 
is  defined  to  ensure  that  a  put  operation  completes  successfully  provided  no 
processor  or  disk  crash  occurs.  A  CarefulPut  operation  repeatedly  writes  a 
page  and  reads  it  until  either  it  puts  a  clean  page  or  some  prescribed  bound 
is  exceeded.  Similarly,  a  careful  get  operation  reads  a  page  repeatedly  until 
either  it  gets  a  clean  page  or  some  prescribed  bound  is  exceeded. 

A  Cleanup  operation  periodically  checks  the  status  of  two  pages;  if  one 
of  the  pages  is  corrupted  and  the  other  page  is  in  good  state,  then  the 
cleanup  procedure  replaces  the  contents  of  the  corrupted  page  by  the  contents 
of  the  good  page.  This  operation  is  periodically  applied  to  each  StablePage 
in  the  system.  If  Tc  is  the  period  of  invoking  the  cleanup  procedure,  then 
for  a  StablePage  to  be  reliable  and  highly  available  the  period  Tc  must  be 
small  enough  so  that  the  probability  of  both  DlskPages  of  a  StablePage  getting 
corrupted  is  infinitesimally  small. 

A  StablePage  is  constructed  from  two  disk-pages  by  procedures  that  use  the 
CarefulGet  and  CarefulPut  operations.  A  StablePut  operation  writes  a 
StabePage  by  calling  CarefulPut  to  write  a  main  memory  page  to  a  disk  page 
once  and  then  calling  CarefulPut  write  the  same  main  memory  page  to  a 
different  disk  page,  StableGet  is  defined  similarly  by  using  CarefulGet. 

Another  example  of  careful  replacement  is  the  use  of  a  shadow  copy  of  an 
object  that  consists  of  multiple  pages  to  facilitate  recovery.  A  current 
version  and  shadow  version  of  the  object  are  maintained.  The  updates  from  an 
uncommitted  transaction  affect  only  the  current  version  of  the  object.  On 
transaction  commitment,  the  current  version  is  made  the  shadow  version, 
thereby  making  the  updates  permanent.  On  transaction  abort,  the  current 
version  is  deleted  and  the  shadow  version  is  made  the  most  current  version. 
The  operation  of  replacing  the  shadow  version  by  the  current  version  must  be 
atomic  and  done  in  one  Instruction.  The  technique  described  below  [LORI77] 
does  this. 

Suppose  that  an  object  is  represented  by  a  set  of  StablePages  {P1,...Pn} 
in  the  stable  storage.  The  pages  of  the  object  are  mapped  from  main  memory  to 
disk  via  a  page  table  that  has  one  entry  per  page.  The  old  version  of  the 
object  is  preserved  in  a  shadow  page  table  that  points  to  the  pages  of  the 
shadow  version.  The  current  page  table  is  Initially  set  to  the  shadow  page 
table  to  facilitate  reading  the  object.  Updates  to  pages  of  the  object  are 
noted  in  the  current  page  table.  When  the  current  version  of  the  object  is  to 
be  written  to  stable  storage,  any  pages  of  the  object  that  have  been  updated 
are  written  to  new  disk  pages  using  StablePut,  the  current  page  table  is 
updated  to  show  the  mapping,  and  then  the  current  page  table  is  written  to 
disk  using  StablePut.  Any  crash  during  the  execution  of  this  procedure,  but 
before  the  completion  of  the  last  StablePut  operation  will  abort  the 
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transaction.  Successful  completion  of  the  last  StablePut  operation  implies 
permanence  of  the  updates. 


3. 2. 2. 3  Logs/Audit  Trail 


In  this  technique,  actions  performed  on  an  object  are  recorded  in  a  log 
or  audit  trail.  The  purpose  of  the  logs  is  to  support  either  undo  of  the 
logged  action  for  state  rollback,  or  redo  the  logged  action  to  ensure 
permanence  of  results  produced  by  committed  transactions.  Logs/audit- trails 
are  used  to  either  restore  an  object  to  a  state  prior  to  executing  a  sequence 
of  operations  on  it  or  to  ensure  the  permanence  of  the  effect  of  executing  a 
sequence  of  operations  on  it.  The  logs  that  facilitate  object  state  recovery 
record  the  undo  operation  corresponding  to  every  action  performed  on  an 
object,  and  the  logs  that  are  used  to  ensure  permanence  of  effect  record  the 
redo  operation  for  every  operation  performed  on  the  object.  An  undo  record 
for  an  operation  on  an  object  specifies  the  actions  to  be  executed  to  nullify 
the  effect  of  executing  that  operation  on  the  object.  A  redo  record  for  an 
operation  basically  records  the  actions  performed  by  the  operation. 

Logs  that  contain  the  redo  actions  are  called  the  forward  logs,  and  the 
logs  that  record  the  undo  actions  are  called  the  backward  logs.  The  backward 
logs  either  record  the  inverse  operations  or  the  values  of  the  object  before 
the  application  of  the  logged  action.  During  a  recovery  process,  a  backward 
log  is  used  by  scanning  it  backwards  for  undoing  actions  in  a  last>in, 
first-out  fashion.  Thus  a  backward  log  can  be  viewed  as  a  push-down  stack. 
During  system  recovery,  a  forward  log  is  scanned  in  the  FIFO  order  as  a  queue. 

A  forward  log  is  said  to  be  idempotent  if  any  number  of  (complete  or 
aborted)  repeated  executions  of  the  log  from  the  beginning  leave  the  updated 
objects  in  the  same  state.  Such  logs  are  also  referred  to  as  intention  lists. 
One  way  to  implement  forward  logs  is  to  use  differential  files.  In  this 
technique,  all  updates  to  an  object  are  recorded  on  a  differential  file.  The 
updates  from  the  differential  file  are  periodically  merged  into  the  main  copy 
of  the  object  and  such  updates  are  then  deleted  from  the  differential  file. 
The  differential  file  technique  provides  a  relatively  inexpensive  means  of 
maintaining  multiple  versions  of  a  large  object.  Intentions  lists  and  forward 
logs  are  forms  of  differential  files  containing  redo  actions  that  record  the 
new  values  of  the  objects  and  have  the  property  of  idempotency.  The  property 
of  idempotency  implies  that  repeated  executions  (some  of  which  may  be 
incomplete)  of  this  sequence  of  actions  would  always  bring  the  updated  object 
to  the  same  state. 

The  backward  log  technique  is  used  when  changes  are  made  in-place  in  the 
stable  storage.  The  recovery  techniques  based  on  backward  logs  follow  the 
write-ahead-rule:  (1)  Before  performing  an  operation  in-place  on  an  object, 
record  the  corresponding  UNDO  action  in  the  log  and  force  the  log  on  the 
stable  storage;  (2)  Before  committing  a  transaction  (i.e.,  sending  a  commit 
response  to  the  user),  either  the  updated  versions  of  the  objects  or  the 
corresponding  forward  logs  must  be  forced  on  the  stable  storage.  This  rule 
makes  sure  that  if  the  system  crashes  or  the  transaction  aborts,  the  backward 
log  can  provide  a  means  for  restoring  the  object  which  has  been  updated 
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in-place.  Similarly,  for  a  committed  transaction,  the  updates  made  by  it  are 
guaranteed  to  be  made  permanent  by  using  the  forward  logs. 


3. 2. 2. 4  Commit  Protocols  and  Atomic  Actions 


Commit  protocols  are  used  for  implementing  atomic  actions  in  a 
distributed  system.  The  commit  protocols  enforce  the  atomicity  of 
transactions  in  the  presence  of  node  crashes  and  communication  link  failures. 
The  concept  of  commit  protocols  was  independently  introduced  by  Gray  [GRAY79]» 
and  Lampson  and  Sturgis  [LAMP76]. 

A  transaction  begins  execution  at  a  single  node.  When  an  operation  is  to 
be  performed  on  objects  at  remote  nodes  a  worker  process,  or  cohort,  is 
initiated  at  that  node.  When  the  operations  of  the  transaction  have  been 
executed  the  processes  execute  a  commit  protocol  to  ensure  that  either  all  of 
the  processes  decide  to  commit  or  to  abort  the  transaction.  If  the 
transaction  commits,  the  updates  are  made  permanent;  otherwise,  the  objects 
are  released  in  the  state  that  they  were  in  prior  to  the  transaction's 
execution.  This  maintains  database  consistency  by  ensuring  the 
"all-or-nothing"  property  of  the  global  transaction. 

The  design  of  a  commit  protocol  must  address  a  number  of  issues.  A 
decision  must  be  made  as  to  whether  the  control  of  the  commit  protocol  is  to 
be  centralized  or  decentralized.  If  it  is  centralized,  how  a  commit 
coordinator  is  determined  must  be  defined.  If  it  is  decentralized,  an 
efficient  solution  that  minimizes  messages  must  be  devised.  For  both  cases, 
what  actions  are  taken  if  a  failure  occurs  can  Impact  system  Integrity  and 
performance.  If  a  failure  occurs,  a  commit  protocol  could  either  cause  all 
further  access  to  an  object  to  be  blocked  or  not  blocked.  Ideally,  the  pierlod 
of  time  that  an  object  is  in  a  locked  state  that  is  vulnerable  to  a  failure 
should  be  minimized.  Some  of  the  desirable  characteristics  for  a  commit 
protocol  are:  1)  guaranteed  transaction  atomicity,  2)  minimal  overhead  in 
terms  of  log  writes,  3)  optimized  performance  in  no-failure  case,  4)  ability 
to  "forget"  the  outcome  of  commit  processing  after  a  while,  and 
5)  exploitation  of  read-only  transactions  [MOHA83]. 

A  number  of  commit  protocols  have  been  proposed.  The  most  common  are 
one-phase  and  two-phase.  They  are  both  centralized,  blocking  protocols.  The 
major  difference  between  them  is  the  length  of  time  during  which  a  cohort  is 
vulnerable  to  the  failure  of  a  coordinator.  How  these  and  other  commit 
protocols  address  the  above  mentioned  issues  are  discussed  in  detail  in  the 
system  designers  guidebook. 


3. 2. 2. 5  Replication  Management  in  Distributed  Systems 


A  distributed  computer  system  can  offer  benefits  if  objects  are 
replicated  and  their  management  adjusted  to  take  advantage  of  the  multiple 
copies.  The  benefits  can  include  improvements  in  performance  and  reliability. 
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The  former  is  {^sslble  due  to  the  reduction  in  communication  cost  to  access  an 
object  and  the  Increase  in  parallelism  of  operations  on  an  object.  The  latter 
is  possible  because  operations  can  continue  despite  the  loss  of  system 
components.  For  example,  if  a  directory  is  replicated  on  every  site  on  a 
distributed  system,  the  cost  of  reading  It  is  the  cost  of  accessing  a  local 
storage  device  (e.g.,  there  is  no  overhead  incurred  due  to  communication 
between  two  sites).  It  is  possible  for  users  on  multiple  sites  to  be 
simultaneously  accessing  the  directory,  further  improving  a  system's 
performance.  Finally,  if  a  site  falls,  the  directory  can  still  be  accessed  by 
any  operating  sites. 

Unfortunately,  increases  in  reliability  and  performance  do  not  come  for 
free  and  in  many  cases  are  not  mutually  attainable.  This  tradeoff  in  system 
attributes  is  often  determined  by  a  correctness  criterion  that  describes  a 
relationship  between  the  values  of  the  replicas  of  a  distributed  object  at  any 
point  in  time.  The  correctness  criteria  must  ensure  that  a  replication  update 
algorithm  satisfies  the  mutual  consistency  property:  all  replicas  of  an 
object  converge  to  the  same  state  and  become  identical  if  update  operations 
cease . 

The  most  common  requirement  of  consistency  has  been  based  on  the  notion 
of  seriallzability  of  transactions  —  the  effect  of  the  execution  of  a  set  of 
transactions  is  equivalent  to  some  serial  schedule.  This  is  called  a  strong 
consistency  requirement.  It  requires  that  some  subset  of  the  set  of  copies  of 
an  object  converge  to  a  common  state  within  the  time  it  takes  for  a  single 
transaction's  execution. 

A  different  requirement  for  consistency  may  be  derived  from  observing 
applications  such  as  directories,  calendars,  or  network  resource  tables.  The 
use  of  these  objects  does  not  require  that  they  have  the  most  up-to-date 
information.  For  example,  a  network  name  server  may  access  an  object's  old 
site  and  be  directed  to  its  new  site,  or  a  message  may  be  routed  through  a 
network  over  a  longer  than  optimal  path  because  its  routing  table  is  slightly 
out  of  date.  But  the  services  may  be  achieved  with  using  non- identical  copies 
of  an  object.  The  consistency  requirement  for  them  is  that  they  must 
eventually  converge  to  a  common  state  if  changes  to  the  object  stop.  This 
convergence  may  span  the  time  it  takes  for  multiple  transactions  to  execute. 
This  correctness  criteria  is  called  weak  consistency.  It  is  a  property  of  the 
application. 

A  third  correctness  criteria  related  to  consistency  exists.  It  is  called 
semantic  consistency  and  is  a  property  of  a  set  of  transactions  of  an 
application.  Semantic  consistency  seeks  to  find  relationships  (e.g., 
commutative,  inverses,  etc),  between  the  effects  of  transactions  that  allow 
them  to  be  executed  according  to  a  non-serial izable  schedule.  To  the  best  of 
our  knowledge,  semantic  consistency  has  not  been  applied  to  performing  updates 
on  replicated  objects.  However,  it  has  been  proposed  as  a  technique  for 
merging  replicated  objects  that  existed  in  different  network  partitions  when 
the  partition  is  repaired. 

In  some  sense,  consistency  criteria  can  be  seen  as  points  on  a  spectrum 
differentiated  by  the  amount  and  type  of  activity  that  may  occur  in  a  system 
at  any  point  in  time.  The  three  consistency  criteria  discussed  are  points  in 
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this  spectrum  that  are  currently  known  and  are  not  meant  to  be  interpreted  as 
the  only  possible  criteria. 

The  problem  of  managing  replicated  objects  can  be  divided  into  four  parts 
—  normal  operation,  detecting  a  failure  and  transitioning  into  a  degraded 
mode  of  operation,  operating  in  a  degraded  mode,  and  merging  partitions  during 
recovery.  The  first  and  third  part  of  the  problem  are  the  same  problem  but  in 
a  different  operating  environment.  They  are  almost  always  addressed  by  a 
single  mechanism  and  will  be  discussed  as  a  single  problem  in  this  paper. 
Transitioning  into  a  degraded  mode  has  two  subparts  —  termination  and 
recovery.  Termination  is  the  action  taken  by  operational  sites  when  they 
determine  that  a  site  has  failed  and  effects  a  transaction.  Recovery  is  the 
action  taken  by  a  site  to  clean  up  any  existing,  uncompleted  transactions  when 
it  becomes  operational  after  previously  failing.  Finally,  merging  is  when  a 
set  of  sites  acts  to  bring  multiple  copies  of  an  object  into  a  consistent 
state.  It  is  helpful  to  recognize  these  distinct  parts  in  order  to  understand 
the  advantages,  disadvantages,  and  applicability  of  the  algorithms  to  be 
discussed. 

A  number  of  algorithms  have  been  designed  to  ensure  that  the  copies  of  an 
object  meet  some  consistency  correctness  criteria.  The  system  designers 
guidebook  discusses  how  some  of  these  algorithms  operate  and  their  effect 
under  normal  and  degraded  operation.  Degraded  operation  exists  when  either  a 
site  goes  down,  a  communication  link  is  lost,  a  network  is  partitioned,  or  a 
message  is  lost  or  duplicated.  Some  update  algorithms  are  tolerant  of  some  of 
these  failures  and  have  no  explicit  distinction  between  normal  and  degraded 
operation.  Other  algorithms  cannot  tolerate  failures  and  may  block  an 
operation  until  recovery  from  the  failure  has  been  completed,  or  abort  the 
operation. 

An  attribute  of  interest  is  availability:  the  probability  that  an  object 
can  be  accessed  and  an  operation  successfully  performed.  Those  algorithms 
that  ensure  weak  consistency  result  in  a  higher  availability  of  objects. 
Strong  consistency  requirements  typically  restrict  the  concurrency  level  to  a 
single  update  transaction  and  multiple  read  only  transactions.  They  further 
restrict  access  to  the  replicated  object  by  only  one  partition  during  degraded 
operation. 

There  are  some  general  relationships  among  a  replication  algorithm's 
distribution  of  control,  consistency  criteria,  reliability,  and  performance. 
Centralized  control  supports  strong  consistency  and  freedom  from  deadlock 
well,  but  is  susceptible  to  single  points  of  failure.  It  potentially  can 
create  performance  problems  (bottlenecks)  and  thereby  reduce  the  availability 
of  an  object  under  both  normal  and  degraded  operation.  Decentralized  control 
can  potentially  increase  the  throughput  of  a  system  and  the  tolerance  of  a 
system  to  single  point  failures.  Weak  consistency  is  not  appropriate  for 
centralized  control;  it  is  naturally  achieved  through  decentralized  control. 
Weak  consistency  increases  a  site's  throughput  and  response  time,  an  object's 
availability,  and  a  system's  resilience  to  multiple  failures. 
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3. 2. 2. 6  Network  Partitioning  and  Continued  Operations 


Under  the  conditions  of  network  partitioning,  allowing  sites  to  update  a 
replicated  database,  some  copies  of  which  are  in  an  inaccessible  partition, 
may  result  in  inconsistency  among  the  copies.  This  inconsistency  among  the 
copies  requires  resolving  when  the  partition  is  repaired.  In  [BLAU83]  two 
schemes,  called  Data  Patch  and  Log  Transformation,  have  been  proposed  for 
integrating  the  inconsistent  copies  of  the  database.  The  technique  called 
Data  Patch  [GARC83a]  relies  on  the  data  values  and  the  semantic  knowledge  of 
the  database.  The  technique  of  Log  Transformation  uses  the  logs  of  the 
transactions  executed  during  network  partitioning  for  the  integration  purpose. 

Data  Patch  is  an  example  of  the  forward  error  recovery  technique.  In 
this  approach  the  data  values  before  the  partition  and  the  data  values  at 
different  sites  after  the  partition  are  examined  during  the  partition  repair 
time.  Depending  on  various  different  criteria  and  consistency  requirements, 
the  final  merged  value  of  the  data  is  determined.  The  criteria  and  techniques 
for  determining  the  repaired  values  are  determined  at  the  time  of  the  database 
design;  the  database  administrator  uses  tools  based  on  these  policies  to 
Integrate  different  copies  of  the  database  during  the  partition  repair. 

The  usefulness  of  the  data-patch  technique  strongly  depends  on  a  thorough 
understanding  of  the  application  environment.  This  technique  fails  to  deal 
with  network  partitioning  during  Integration.  Data-patch  allows  ad  hoc 
updates,  but  such  updates  require  restrictions  to  keep  the  integration  rules 
appropriate.  As  new  transaction  types  are  added,  the  integration  rules  must 
be  updated  appropriately.  The  compensating  actions  that  require  only  the 
database  values  at  the  merge  time  are  efficient  to  execute  compared  to  those 
actions  which  require  examination  of  the  execution  logs  to  determine  which 
transactions  generated  these  data  values. 

Log  transformation  relies  on  the  logs  of  transaction  executions  at 
different  sites  for  merging  the  partitioned  copies  of  a  database.  This 
technique  does  not  make  use  of  the  data  values  at  the  merge  time.  It  assumes 
that  all  transactions  are  pre-defined  and  it  requires  that  a  database 
administrator  specify  the  semantic  properties  and  relationships  between  the 
transaction  types.  For  example,  transaction  T1  and  T2  commute,  or  transaction 
T2  overwrites  all  data  written  by  T1.  The  transaction  logs  are  merged 
according  to  these  rules  and  other  rules  that  apply  integrity  constraint 
checks . 

During  the  partition  merge  tine,  the  execution  logs  from  each  partition 
are  exchanged  and  new  merge  logs  are  built.  In  constructing  merge  logs  some 
conflicting  transactions  are  undone  and  re-run.  The  merge  logs  are  generated 
independently  at  each  site;  therefore,  they  may  be  different  at  different 
sites.  However,  the  log  transformation  technique  assumes  that  there  is  a 
system-wide  policy  to  re-order  conflicting  transactions.  One  possible 
criterion  for  determining  the  order  of  execution  for  transactions  in  the 
merged  logs  is  their  execution  time. 

The  applicability  and  usefulness  of  the  log- transformat ion  technique  is 
dependent  on  the  application.  As  in  case  of  data-patch,  the  transactions  in 
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the  system  are  of  pre-deflned  types.  As  new  transaction  types  are  added  to 
the  system,  necessary  Information  is  required  to  support  correct  operations  of 
this  technique. 


3.3  Summary 


In  distributed  systems,  the  operating  conditions  that  may  arise  due  to 
concurrency  and  component  failures  strongly  influence  the  consistency 
management  techniques.  This  Is  shown  in  Figure  3-1*  Concurrency  of 
operations  requires  techniques  to  maintain  mutual,  internal,  and  external 
consistency  requirements.  Depending  on  these  consistency  requirements, 
serializabllity  of  transactions  may  be  a  necessary  requirement  for  the 
consistency  management  techniques,  therefore  the  consistency  requirements  have 
been  further  divided  into  two  classes:  those  that  require  serializabllity  as 
a  necessary  requirement,  and  those  that  do  not  require  serializabllity.  The 
concurrency  control  techniques  to  ensure  serializabllity  are  based  on  locking, 
time^stamp,  or  optimistic  protocols.  Most  of  the  work  in  the  area  of 
consistency  management  has  been  in  the  context  of  maintaining  serial 
consistency  of  distributed,  replicated  or  partitioned  databases.  Not  many 
researchers  have  addressed  the  consistency  requirements  that  permit 
non*serializable  interleaved  executions  of  transactions.  Applications  with 
such  consistency  requirements  can  be  important  if  continued  operations  are  to 
be  permitted  in  spite  of  network  partitioning. 
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The  component  failures  have  been  divided  into  two  classes:  silent 
failures  and  malicious  failures.  Silent  failure  of  a  component  means  that  the 
failed  component  does  not  generate  or  forward  any  information.  In  a  malicious 
failure,  the  failed  component  may  generate  wrong  messages  or  distort  the 
messages  it  forwards.  Silent  failures  of  components  affect  the  internal, 
mutual,  and  external  consistency  in  the  system.  The  techniques  for 
maintaining  system  consistency  under  such  failures  are  based  on  the  concept  of 
atomic  actions.  The  problem  of  interactive  consistency  arises  in  the  presence 
of  malicious  failures  of  components.  The  solutions  to  such  problems  are  based 
on  the  solution  to  the  Byzantine  Generals'  problem. 
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CHAPTER  4 


ABSTRACT  DISTRIBUTED  SYSTEM  ARCHITECTURE 


The  architectural  features  of  distributed  systems  offer  a  great  potential 
for  designing  reliable  systems  because  the  physical  isolation  between  system 
components  can  reduce  the  correlation  among  component  failures,  and  the 
redundancy  of  resources  can  support  continued  operations  in  the  event  of 
failures.  Reliability  and  consistency  management  techniques  provide  the 
building  blocks  from  which  reliable  distributed  systems  are  built.  However, 
this  potential  has  largely  remained  unexploited  because  of  the  lack  of  a 
formal  discipline  to  integrate  the  existing  and  known  recovery  techniques  into 
the  designs  of  distributed  systems.  This  chapter  presents  an  object-oriented 
design  model  for  distributed  systems  which  facilitates  a  systematic  and 
well-structured  integration  of  known  recovery  and  consistency  management 
techniques  into  the  designs  of  distributed  systems.  We  first  discuss  some  of 
the  techniques  that  can  be  used  for  structuring  systems.  Next  a  design  model 
for  reliable  distributed  systems  is  presented  with  a  discussion  of  how  the 
reliability  and  consistency  management  techniques  described  In  Chapter  3  can 
be  used  to  implement  the  functions  of  the  design  model.  Finally,  a  system 
structure  that  combines  the  design  model  with  object  oriented  design 
techniques  is  presented. 


4.1  System  Structuring  Concepts 

Much  of  the  recent  research  in  reliable  system  design  is  actually 
exploration  into  system  structuring  techniques.  distributed  systems  are 
intrinsically  more  complex  than  centralized  systems.  A  structured  approach 
reduces  design  complexity,  by  factoring  the  designs  into  layers  that  create 
different  levels  of  functional  abstraction;  the  design  of  each  layer  can  then 
be  carried  out  somewhat  independently  of  the  design  of  the  other  layers.  The 
layers  in  the  system  can  be  viewed  as  creating  horizontal  partitions  in  the 
system  design. 

Another  structuring  concept,  which  is  dual  as  well  as  orthogonal  to  a 
layered  approach,  is  object-orientation  which  creates  vertical  partitions  in 
the  system.  The  interactions  between  these  partitions  occur  through  some 
well-defined  interfaces;  thus,  each  partition  in  the  system  represents  an 
independent  domain  where  the  internal  structure  of  a  domain  can  not  be 
directly  accessed  by  other  domains.  A  vertical  partition  essentially  embodies 
the  concept  of  objects  in  the  system.  The  whole  system  is  viewed  as  a 
collection  of  objects.  All  state  transformations  in  one  partition  by  other 
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partitions  are  performed  through  the  interfaces  defined  by  the  partition.  The 
advantage  of  such  an  approach  is  that  the  design  of  the  internal  structure  of 
any  given  partition  is  Independent  of  the  designs  of  other  partitions.  These 
are  the  fundamental  principles  of  data  abstraction.  From  the  viewpoint  of 
reliable  system  design,  such  an  approach  is  very  attractive  because  it 
supports  confinement  of  errors  within  an  object  boundary.  This  also  implies 
that  the  recovery  mechansims  for  a  given  partition  can  be  designed  to  suit  its 
reliability  requirements. 

There  ^re  two  distinct  approaches  to  designing  reliable  systems.  The 
traditional  approach  takes  a  process-oriented  view  of  the  system  where  objects 
are  bound  to  the  address  space  of  a  process  at  the  time  of  process  creation 
and  execution.  The  process  is  responsible  for  maintaining  the  integrity  of 
these  objects  in  the  presence  of  faults  and  system  crashes,  and  for  recovering 
its  locus  of  execution  in  the  presence  of  faults.  This  approach  uses 
checkpointing  and  rollback  as  primary  recovery  mechanisms  for  constructing 
resilient  processes.  Most  previous  operating  systems  have  used  system-wide 
checkpointing,  saving  the  state  of  all  processes  in  the  system,  irrespective 
of  need.  The  research  in  this  area  has  addressed  the  problems  of  separately 
checkpointing  Interacting  concurrent  processes  [KIM79]  [RUSS80].  The  major 
problem  is  to  avoid  a  domino  effect  in  which  the  rollback  of  one  process  may 
lead  to  a  cascade  of  rollbacks. 

A  second,  more  recent,  approach,  takes  an  object-oriented  view  of  systems 
[LISK82].  In  this  view,  objects  are  of  distinct  types;  each  type  provides  a 
defined  set  of  externally  visible  operations.  Each  object  is  permanently 
bound  to  the  address  space  of  its  object  manager.  Processes  act  upon  these 
objects  by  invoking  the  visible  operations  implemented  by  an  object  manager. 
The  object  manager  is  responsible  for  enforcing  necessary  concurrency  control 
rules  and  recovering  objects  from  faults  and  system  crashes.  The  primary 
recovery  mechanisms  include  forward/backward  logs,  careful  replacement,  and 
object  replication  [K0HL81].  Processes  are  no  longer  responsible  for 
recovering  the  objects  they  access  during  their  execution;  however,  they  are 
still  responsible  for  recovering  their  execution  locus.  This  requires 
establishing  recovery  points,  and  rolling  back  a  process  to  some  recovery 
point.  A  major  advantage  of  the  object-oriented  approach  is  the  clean 
separation  between  the  recovery  functions  for  processes  and  objects.  Another 
advantage  is  that,  for  each  object  type,  the  recovery  mechanisms  and  their 
design  parameters  can  be  selected  to  match  the  type's  integrity  requirements. 


4.2  A  Design  Model  For  Reliable  Distributed  Systems 


A  design  model  for  the  construction  of  reliable  distributed  systems  is 
shown  in  Figure  4-1.  This  model  has  been  inspired  by  Lampson's  lattice  model 
[LAMP8la]  for  reliable  distributed  systems.  The  objective  of  reliable 
distributed  system  designs  is  to  synthesize  secure  and  stable  distributed 
objects  that  survive  crashes  of  system  components  and  support  high 
availability  of  functions.  Such  objects  are  constructed  using  unreliable 
resources  such  as  physical  storage  (disks),  physical  processors,  and  the 
communication  media.  This  section  describes  the  design  model  in  a  bottom-up 
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fashion.  The  properties  of  each  functional  abstraction  in  the  aodel  and  the 
recovery  aechanisas  that  can  achieve  these  properties  are  identified. 

The  physical  storage  refers  to  non-volatile  disk  storage  that  has  a 
non-zero  probability  of  inforaation  loss.  For  exaaple,  a  page  on  a  disk  may 
get  corrupted  due  to  a  head-crash  or  aoma  other  aalfunction.  Another  problea 
with  the  physical  storage  is  the  non-atoalclty  of  write  operations  on  pages. 
For  exaaple,  a  crash  aay  occur  in  the  disk  systea  during  writing  a  new  value 
on  a  page.  This  leaves  the  page  in  an  undefined  state  because  the  old  value 
has  been  destroyed  and  the  new  value  has  not  been  written  coapletely. 

The  stable  storage  facility,  which  Is  constructed  froa  the  physical  disk 
storage,  provides  atoalc  write  operations  on  pages.  It  enhances  the 
availability  of  data  by  increasing  the  aean-tiae- to- failure  of  a  disk  page. 
Lanpson  introduced  a  technique  for  constructing  stable  storage  froa  unreliable 
disc  storage  facility  [LAMPSla].  It  Is  based  on  careful  replaceaent. 
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A  physical  processor  loses  its  control  state  data  on  crashes;  a  restart 
operation  can  only  cause  a  process  to  execute  froa  the  beginning.  A 
stable  processor  facility,  on  the  other  hand,  supports  saving  of  process 
states  on  the  stable  storage,  and  restarting  a  process  froa  aoae  previously 
saved  process  state.  The  operation  of  saving  process  states  is  called 
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checkpointing.  Processes  are  considered  as  objects  that  are  supported  by  a 
stable  processor  facility. 

The  next  level  of  abstraction  provides  secure  and  stable  objects  based  on 
stable  storage,  stable  processor  and  unique  Identifier  (UID)  facilities. 
Stable  objects  are  those  that  survive  systea  crashes  with  a  high  probability 
and  for  which  the  prlsltlve  operations  (l.e.,  the  operations  supported  by  the 
type  definition)  are  atoalc.  Secure  objects  are  protected  objects  which  can 
only  be  accessed  by  authorized  users. 

Every  object  In  the  systea  Is  given  a  globally  unique  naae  using  the  UID 
facility.  This  naae  Is  never  reused  In  the  entire  life-tlae  of  the  systea. 
From  this  unique  identifier,  the  type  of  the  object  can  be  inferred.  The  UID 
also  contains  the  Identification  of  the  node  where  the  object  was  created. 
Objects  in  the  systea  may  algrate  froa  one  node  to  another.  The  UID  facility 
defines  the  logical  name  space  in  the  systea.  Operations  on  an  object  are 
Invoked  by  specifying  the  UID  of  the  object  and  the  operation  naae.  Because 
the  type  of  an  object  can  be  deterained  froa  the  unique  identifier  of  that 
object,  operation  invocation  on  an  object  is  directed  to  the  appropriate 
object  manager  for  that  type.  The  operations  on  the  reaote  and  the  local 
objects  are  Invoked  in  an  identical  fashion.  It  is  for  this  reason  that  we 
find  the  remote  procedure  call  paradiga  a  convenient  abstraction. 

The  UID  generation  is  based  on  the  stable  storage  and  the  stable 
processor  facilities.  The  UID  generation  facility  is  based  on  a  local  clock 
process  or  a  sequence  counter  that  uses  the  stable  storage  to  survive  systea 
crashes  and  to  ensure  that  the  same  UID  is  not  regenerated  on  restart  of  a 
node  after  a  crash.  The  UID  for  an  object  indicates  the  type  of  the  object 
and  the  node  where  it  was  created.  A  scheae  for  generating  UID  in  a  reliable 
fashion  Is  described  in  [SCHA831. 

The  abstraction  of  recoverable  objects  provides  aechanisas  to  restore  the 
state  of  an  object  after  having  aade  soae  changes  to  it,  or  to  coaait  a  change 
to  the  object  state.  The  concept  of  coaaitaent  forbids  any  restoration  to 
states  before  coaaitaent.  Coaaitaent  of  a  change  to  an  object  essentially 
implies  permanence  of  the  changes  aade  to  the  object  since  the  last  coaait 
operation  on  it. 

We  use  the  concept  of  imautable  versions  to  lapleaent  autable  recoverable 
objects.  An  imautable  object  is  one  that  is  never  changed  once  it  is  created, 
l.e.,  every  change  to  an  object  creates  a  new  object.  In  our  model  every 
change  to  an  object  creates  a  new  version  of  that  object;  this  version  is 
uniquely  identifiable  by  using  the  UID  of  the  object  and  the  version  nuaber. 
These  principles  are  discussed  in  CREED78]  and  [SV0B81]. 

Reliability  techniques  most  suitable  for  constructing  recoverable  objects 
Include  multiple  versions,  differential  files,  intention  lists,  audit 
trails/logs,  and  self- identifying  objects.  Generally,  a  coabination  of 
several  of  these  techniques  is  used  in  constructing  recoverable  objects  at  a 
node. 


Maintaining  multiple  versions  as  a  forward  log  is  less  expensive  than  as 
copies  of  the  original  object.  A  forward  log  in  which  the  sequence  of  changes 
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is  Idempotent  can  be  used  as  an  Intention  list  to  ensure  the  peroanence  of 
results  on  the  couitment  of  a  transaction.  Backward  logs  are  used  for 
restoring  objects  by  undoing  the  actions  recorded  in  the  log.  Whenever  a  new 
uncomfflltted  state  of  an  object  is  to  be  forced  in-place  on  the  stable  storage 
from  the  volatile  meaory,  it  is  essential  that  (in  order  to  keep  the  object 
recoverable)  the  backward  log  be  forced  on  the  stable  storage  before  forcing 
the  uncommitted  object  in-place  on  the  stable  storage. 

Self -identifying  objects  and  consistency  checks  play  an  Important  role 
during  restart  after  a  crash  in  reconstructing  objects,  object  headers  and 
directories  during  the  restart  after  a  crash.  For  example,  with  multiple 
versions  additional  Information  such  as  the  object  UID,  state  of  the  versions 
(committed,  uncommitted,  commit  pending,  etc.),  pointers  to  other  versions  is 
incorporated  for  crash  recovery.  After  reconstructing  the  data  structures  on 
crash  recovery,  the  consistency  checks  are  important  in  checking  the  validity 
and  correctness  of  the  reconstructed  data  structures. 

Atomic  transactions  are  Implemented  using  the  facilities  described  above 
and  some  concurrency  control  mechanisms.  A  transaction  should  be  atomic  in 
the  presence  of  concurrent  operations  and  system  crashes.  Atomicity  of 
concurrent  transactions  requires  suitable  mechanisms  for  concurrency  control. 
There  are  basically  three  distinct  approaches  to  concurrency  control:  locking 
protocols  [ESWA76],  time-stamp  based  schemes  [BERN8I],  and  optimistic 
techniques  [KUNG8I].  Recoverable  objects  support  schemes  to  achieve  atomicity 
of  transactions  in  the  presence  of  system  crashes.  Transactions  in  our  model 
are  treated  as  objects  of  process  type.  As  in  the  case  of  any  other  object  in 
the  system,  a  transaction  is  assigned  a  UID. 

Nested  transactions  provide  the  facility  to  construct  higher  levels  of^ 
abstractions  by  composing  a  set  of  already  defined  transactions  into  one 
larger  transaction.  The  commitment  of  computations  by  each  of  the  nested 
transactions  is  dependent  on  the  commitment  of  the  parent  transaction. 
Concurrency  control  mechanisms  are  required  to  synchronize  nested  transactions 
of  the  same  or  different  parent  transactions. 

The  remote  procedure  call  mechanism  Is  based  on  an  atomic  transaction 
facility  to  ensure  the  atomicity  of  operations  in  the  presence  of  system 
crashes  and  other  concurrent  transactions.  The  remote  procedure  call 
mechanism  uses  an  unreliable  datagram  facility  that  supports  high  probability 
of  successful  delivery  of  messages.  In  [SHRl82a]  and  [LISK82a]  arguments  are 
given  in  favor  of  building  a  reliable  remote  procedure  call  facility  using 
less  sophisticated  facilities  such  as  a  datagram.  These  are  examples  of 
end-to-end  arguments  [SALT8I]  that  point  out  the  wasteful  duplication  of 
functions  at  different  levels.  Secure  communication  is  achieved  by  encryption 
of  messages  and  storing  unencrypted  messages  in  protected  buffers. 

Table  4-1  summarizes  and  restates  the  design  model  in  terms  of  design  levels 
for  a  system.  For  each  level  the  faults  that  can  be  handled,  how  they  can  be 
detected,  and  what  error  recovery  techniques  can  be  used  are  listed. 
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Table  4-1.  Summary  of  Application  of  Recovery  Mechanisms 


Design  Level 

Fault  Handled 

Error  Detection 

Error  Recovery 

At  This  Level 

Techniques 

Techniques 

0  Site  Crashes 

0  Time-out 

0 

Object  Replication 

0  Link  Failures 

0  Status  Query 

-  Majority  Voting 

0  Lost  Messages 

0  Consistency 

-  Quorum  Based 

0  Loss  of  Objects 

Checks 

Voting  Schemes 

Distributed 

(Processes 

0  Acceptance 

-  Survlvable  Set 

Applications 

and  Data) 

Tests 

and 

0  Network 

0  Diagnostic 

0 

Primary/Stand-by 

Distributed 

Partitioning 

Tests 

-  Periodic 

Objects 

0  Software 

Checkpointing 

Malfunctioning 

0 

-  Reconfiguration 
Restart  and  Retry 

0 

Recovery  Blocks 

-  Primary/Alternate 
Blocks 

-  Acceptance  Tests 

0 

Exception  Handling 

0 

Salvation  Programs 

0  Site  Crashes 

0  Time-out 

0  Commit  Protocols 

0  Memory  Failures 

0  Status  Query 

0  Conditional 

Atomic 

0  Software 

0  Interface  Test 

commitment  of 

Transactions 

Malfunctioning 

0  Acceptance 

nested  transactions 

0  Duplicate 

Test 

Messages 

0  Site  Crashes 

0  Interface 

Backward  Error 

0  Memory  Failures 

Tests 

Recovery 

0  Software 

0  Acceptance 

0  Multiple 

Malfunctioning 

Tests 

Versions 

Recoverable 

0  Loss  of  Objects 

0  Time-out 

0  Differential 

Objects 

(Processes  and 

0  Periodic 

Files 

Data) 

Consistency 

0  Intention 

\ 

Checics 

Lists 

0  Audit  Trails/ 

Logs 

0  Self- Identifying 
Objects 

0  Salvation 

Programs 

0  Incremental 

Dumping 

0  Process 

Checkpoint 
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Table 

4-1  (cont) 

Design  Level 

Fault  Handled 

At  This  Level 

Error  Detection 
Techniques 

Error  Recovery 

Techniques 

Communication 
Level  (Messages) 

0  Link  Failures 

0  Message 
Corruption 

0  Lost  Messages 

0  Duplicate 
Messages 

0  Time-out 

0  Status  Query 

0  Acks 

0  Checksum/Parity 
Checks 

0  Seq.  Numbers 

0  Retransmissions 

0  Alternate  Links 
and  Communication 

Paths 

0  Replicated  Messages 

Disk  Pages 

0  Read /Write 
Errors 

0  Loss  of  Disks 

0  Corruption  of 
Pages 

0  Periodic 
Consistency 
Checks 

0  Checksum/Parity 
Checks 

0  (ireful  Writes 
on  Pages 

0  Replication  of 

Disk  Pages 

0  Pages  Replicated 
on  Multiple  Disks 

Techniques  dealing  with  network  partitioning,  acceptance  tests,  interface 
tests,  consistency  checks,  and  exception  handling  are  highly  dependent  on  the 
applications. 


4.3  A  Model  Of  An  Object  Oriented  Reliable  Distributed  System 

This  section  takes  the  reliability  design  model  presented  in  the  previous 
section  and  integrates  it  into  an  on  object  oriented  design.  The  concept  of 
object  managers  is  the  basis  for  system  structuring.  An  object  manager 
provides  the  encapsulation  for  a  given  type  of  objects;  all  objects  of  that 
type  are  accessed  or  updated  via  that  object  manager.  In  this  model  the 
construction  of  reliable  distributed  objects  is  based  on  an  atomic  transaction 
facility  and  a  remote  procedure  call  mechanism.  This  approach  is  summarized 
in  Figure  4-2. 

The  lowest  layer  in  this  figure  represents  the  kernel  functions  that 
execute  at  every  host  node  of  the  distributed  system.  Above  the  kernel  layer 
are  the  local  object  management  functions  such  as  storage  management,  access 
control,  synchronization,  and  object  recovery.  This  layer  represents  the 
functions  that  are  associated  with  every  object  manager  in  the  system;  the 
functions  at  this  level  deal  only  with  the  centralized  object  management.  The 
next  layer  provides  facility  of  atomic  transactions;  thus,  a  sequence  of 
operations  can  be  performed  on  a  set  of  objects  in  an  atomic  fashion.  The 
remote  procedure  call  mechanism  facilitates  operations  on  objects  that  are  not 
local.  He  have  adopted  the  remote  procedure  call  mechanism  because  it 
provides  a  uniform  way  of  accessing  remote  as  well  as  local  objects;  thus. 
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location  of  the  object  Is  transparent  to  the  users  during  access  or  update 
operations.  It  is  Important  to  make  the  semantics  of  remote  and  local 
procedure  calls  identical  in  the  presence  of  host  crashes  and  communication 
link  failures.  In  our  design  we  have  adopted  the  "at  most  once"  execution 
semantics  for  remote  procedure  calls;  thus,  in  the  presence  of  duplicate 
messages  or  on  server  node  crash-restart,  effectively  only  one  execution  of 
the  remote  procedure  will  occur.  The  combination  of  the  remote  procedure  call 
mechanism  with  the  atomic  transaction  facility  is  used  for  managing  objects 
that  are  either  partitioned  or  replicated.  Based  on  these  mechanisms  one  can 
suitably  create  type  definitions  for  replicated  or  partitioned  objects  such 
that  one  can  access  or  update  those  objects  in  the  same  manner  as  updating 
centralized  objects. 


DISTRIBUTED  OBJECT  MAMAGMENT  FUNCTIONS 
(Partitioned  and  Replicated  Objects) 


RELIABLE  REMOTE  PROCEDURE  CALL  MECHANISM 


ATOMIC  TRANSACTION  FACILITY 


LOCAL  OBJECT  MANAGEMENT  FUNCTIONS 
(Concurrency  Control,  Recovery,  Access  Control, 
Object  Storage  Management) 


KERNEL  FUNCTIONS 

(Host  Resource  Management,  Communication,  Scheduling, 
Remote  Call  Handling,  Interrupt  Handling) 


HARDWARE 


A  Model  for  Reliable  Distributed  systems 
Figure  4-2 

I 
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4.3.1  Structure  of  Object-Oriented  Distributed  Systems 


An  object-oriented  system  consists  of  a  collection  of  Type  Managers  and 
the  objects  created  by  them.  Type  Managers  create  vertical  partitions  in  the 
system.  For  a  given  type  in  the  system,  a  Type  Manager  would  exist  at  all 
those  nodes  which  may  be  required  to  store  objects  of  that  type.  A  Type 
Manager  at  a  node  manages  all  objects  of  that  type  at  that  node.  The  multiple 
instances  of  Type  Managers  for  a  type  function  cooperatively  to  provide  the 
abstraction  of  a  single  Type  Manager  for  that  type  in  the  system.  Each  Type 
Manager  defines  an  address  space  in  which  all  the  objects  of  that  type  reside. 
A  Type  Manager  is  logically  viewed  as  a  single  process  that  performs  all  the 
state  transformations  on  the  objects  in  its  address  space  in  response  to 
execution  requests  by  some  other  objects  of  the  same  or  different  type. 

At  a  physical  node,  several  different  Type  Managers  may  reside,  each 
managing  objects  of  its  type  at  that  node.  The  abstract  machine  to  support 
such  an  object-oriented  system  can  be  constructed  from  almost  any 
hardware/software  system  architecture.  The  system  architecture,  which 
includes  the  hardware,  software,  and  the  firmware  architecture,  of  the 
processors  to  support  such  a  system  must  have:  (i)  a  mechanism  for  switching 
the  processor  between  Type  Managers,  (ii)  a  mechanism  for  partitioning 
secondary  memory  resources  among  Type  Managers,  and  (ill)  a  mechanism  for 
exchanging  messages  between  Type  Managers. 

It  can  be  seen  from  the  preceding  model  of  Type  Managers  that  there  is  no 
concept  of  a  system-wide  state  or  uniform  control  and/or  recovery  mechanisms. 
Resource  management  functions  and  recovery  mechanisms  are  partitioned  along 
with  the  set  of  Type  Managers.  The  traditional  functions  of  system-wide 
software  units  such  as  operating  systems  and  database  systems  are  incorporated 
into  a  collection  of  Type  Managers  which  implement  the  basic  elements  of  the 
model  of  distributed  computations.  This  is  a  radically  new  view  of  operating 
systems. 

Object  Type  Managers  are  the  primary  building  blocks  for  the  permanent 
elements  of  the  system.  The  Type-Type  Manager  is  an  object  in  the  system  that 
manages  "types"  in  the  system.  It  is  the  means  by  which  new  types  are 
introduced  into  the  system.  The  concept  of  the  Type-Type  Manager  is 
essentially  the  same  as  that  of  the  TYPE-TYPE  object  in  the  Hydra  design 
[C0HE75] 

The  objects  in  the  system  are  accessed  in  a  uniform  fashion  reg2U*dless  of 
their  locations.  All  operations  on  permanent  objects  are  performed  within  a 
transaction.  A  transaction  is  basically  an  atomic  action  that  is  defined  as  a 
sequence  of  operations  on  local  or  remote  objects.  A  transaction  ensures 
atomicity  of  distributed  operations.  It  is  possible  to  introduce  concurrency 
within  a  transaction  by  creating  one  or  more  nested  parallel  transactions. 
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4.3.2  Functions  of  the  Type  Managers 


The  functional  characteristics  Implemented  by  the  Type  Managers  are  the 
original  basis  for  defining  abstract  data  types.  Extending  abstract  data  type 
concepts  to  include  a  formal  basis  for  the  integration  of  recovery, 
synchronization,  and  access  control  mechanisms  generates  a  number  of 
additional  functions  for  the  Type  Managers: 

1.  Each  Type  Manager  is  directly  responsible  for  the  mapping  of  the 
occurrences  of  the  objects  they  define  to  physical  storage. 

2.  Each  Type  Manager  implements  access  control  policies  for  the 
occurrences  of  its  type. 

3.  Each  Type  Manager  supports  concurrent  execution  of  its  procedures 
and/or  functions. 

4.  Each  Type  Manager  ensures  the  consistency  of  the  objects  it  stores 
under  concurrent  and  distributed  use. 

5.  Each  Type  Manager  implements  the  necessary  levels  of  redundancy  to 
ensure  the  level  of  fault  tolerance  given  in  its  specification. 

This  obviously  Integrates  many  functions  that  have  been  conventionally 
associated  with  database  systems  into  the  object  management  functions  of  this 
operating  system. 


4.3.3  Structure  of  Type  Managers 


Externally  viewed,  a  Type  Manager  is  a  collection  of  functions  and 
procedures  which  can  be  invoked  on  the  objects  of  its  type  by  specifying  the 
identifier  of  the  object  along  with  the  operation  name.  This  causes  an 
invocation  request  message  to  be  sent  to  the  Type  Manager  regardless  of  its 
physical  location  in  the  system.  Internally,  these  operations  are  executed  by 
the  Type  Manager  using  one  or  more  server  processes;  such  server  processes  may 
be  dynamically  created  or  destroyed  by  the  Type  Manager.  The  operations  on 
remote  and  local  objects  are  Invoked  by  the  clients  in  the  same  fashion  as 
procedure  call.  Such  invocations  on  remote  objects  are  performed  by 
implementing  remote  procedure  calls  [NELS8I]  [SHRI82]  with  "at  most  once 
execution"  semantics.  A  Type  Manager  consists  of: 

-  Data  structures  for  the  objects  of  that  type; 

-  Procedures/functions  defining  the  type; 

-  Concurrency  protocols; 

-  Recovery  mechanisms; 

-  A  database  to  manage  the  objects  in  its  domain; 

-  A  controller  process  that  schedules/executes  the  requests. 

A  Type  Manager  is  responsible  for  the  permanent  storage  of  the  object 
instances  of  its  type.  Each  Type  Manager  interfaces  directly  with  some  set  of 
permanent  storage  devices.  The  Type  Manager  generates  the  mapping  from  the 
UID  for  an  object  of  its  type  to  the  physical  storage  on  some  permanent 
storage  devices.  It  also  realizes  object  instantiation  in  the  executable 
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volatile  storage  from  the  permanent  storage.  There  is  no  system-wide  file 
system.  The  object  management  system  takes  the  place  of  a  file  system. 


A  Type  Manager  consists  of  a  controller  process  whose  purpose  is  to 
schedule  server  processes  to  serve  client  requests.  The  server  process  is 
given  the  sane  UID  as  that  of  the  client  process;  thus,  a  client  process  is 
conceptually  viewed  as  migrating  into  the  address  space  of  the  Type  Manager. 
This  view  of  the  migrating  client  process  is  useful  from  the  viewpoint  of 
enforcing  access  rights  associated  with  the  client  process.  On  the  completion 
of  the  requested  service,  the  server  process  is  deallocated.  The  controller 
process  accepts  the  incoming  or  outgoing  invocation  request  messages,  performs 
security  checks,  and  Interfaces  with  the  kernel  procedures.  Effectively,  the 
controller  process  plays  the  part  of  a  local  operating  system  for  the  Type 
Manager;  the  scheduling  policies  can  thus  be  tailored  to  the  specific 
requirements  of  the  Type  Managers.  The  controller  process  manages  the  server 
processes  performing  the  operations  and  provides  them  with  a  set  of  procedures 
that  perform  resource  management,  communication,  protection  and  other  services 
that  are  normally  provided  by  an  operating  system. 

A  Type  Manager's  controller  has  several  responsibilities  related  to 
protecting  its  objects  from  unauthorized  access.  Upon  receiving  an  invocation 
request,  the  controller  must  obtain  and  store  the  requesting  process' 
identification.  This  information  is  made  available  to  the  operation  via  a 
callable  procedure  so  that  the  Type  Manager's  controller  may  check  the  access 
list  of  the  object.  In  addition,  the  controller  appends  the  identification  of 
a  process  which  is  making  an  outgoing  invocation  request  to  some  other  Type 
Manager . 

When  an  incoming  invocation  request  is  received,  the  controller  attempts 
to  locate  the  object  whose  UID  is  given  in  the  request.  First,  the  controller 
looks  for  the  object  in  its  own  local  pool  of  objects.  If  found,  the  program 
which  will  perform  the  operation  on  the  object  is  parameterized  with  the 
object's  local  address  and  then  is  scheduled  as  the  server  process.  If  the 
object  is  not  found  locally,  the  controller  determines  if  a  "forwarding 
address"  has  been  left  for  that  object.  This  might  occur  if  the  object  has 
been  relocated  to  some  other  host.  If  the  object  is  not  found  locally,  the 
controller  sends  a  reply  message  indicating  that  the  object  was  not  found  and 
includes  the  forwarding  address  if  any. 

In  response  to  an  update  request,  the  Type  Manager  creates  a  new  version 
of  the  object.  This  version  is  committed  only  when  the  transaction  that 
created  it  commits;  the  uncommitted  versions  are  discarded  if  the  transaction 
aborts. 

Each  Type  Manager  maintains  a  database  which  records  the  necessary 
information  pertaining  to  the  objects  in  its  address  space.  This  database 
records  the  identifiers  of  the  objects  of  that  type  currently  present  at  that 
node,  their  physical  addresses,  and  the  commitment  status  of  their  most 
current  versions.  A  Type  Manager  is  also  responsible  for  aborting  a  new 
uncommitted  version  by  timing  out  if  it  detects  no  activity  of  the  transaction 
that  created  this  version.  Every  time  a  new  version  of  an  object  is  created 
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by  a  transaction  by  invoking  an  update  operation,  tne  Type  Manager  ensures 
that  this  new  version  is  written  onto  the  stable  storage  before  sending  an 
acknowledgement  for  the  operation.  A  scheme  for  maintaining  such  multiple 
versions  using  differential  files  is  described  in  Chapter  4. 

Type  Managers  are  responsible  for  ensuring  that  each  of  their  defined 
operations  Is  atomic.  The  operation  must  either  complete  successfully  or  else 
abort,  leaving  the  object  completely  unmodified.  This  is  not  difficult  to 
achieve  If  only  local  objects  are  being  modified  In  the  operation.  However, 
if  the  operation  involves  invoking  operations  on  other  Type  Managers,  then  the 
controller  uses  the  transaction  facility  to  ensure  the  atomicity  of  the 
update.  If  the  Type  Manager  is  structured  so  that  operations  may  be  executed 
concurrently,  the  controller  ensures  that  objects  are  not  being  modified  by 
two  operations  simultaneously  or  read  by  one  operation  while  being  modified  by 
another.  Each  type,  in  general,  has  its  own  set  of  constraints  on  the  allowed 
order  of  execution  of  its  operations  on  a  given  object.  These  constraints  are 
supplied  when  the  Type  Manager  is  created. 


4.3.4  Distributed  Types 


The  reason  for  introducing  the  concept  of  distributed  types  in  the  system 
is  to  make  transparent  the  distributed  nature  of  an  object  that  is  logically 
viewed  as  a  single  object.  The  components  of  an  object  may  be  distributed  by 
replication  or  partitioning.  The  transparency  of  the  replicated  or 
partitioned  nature  of  an  object  is  a  convenient  abstraction  which  makes 
updating  and  accessing  of  distributed  and  centralized  objects  identical. 

A  distributed  type  is  an  abstract  data  type  whose  concrete  representation 
is  distributed.  For  example,  an  abstract  type  called  reliable-file  might  be 
implemented  using  physically  distributed  replicated  copies  of  a  file,  or  a 
global  database  might  be  implemented  as  a  set  of  partitioned  distributed 
components.  The  consistency  and  coordination  among  the  distributed  components 
of  the  concrete  representation  is  specified  in  the  type  definition  and 
enforced  by  the  distributed  Type  Manager.  Unlike  the  centralized  objects,  an 
occurrence  of  a  distributed  type  does  not  have  a  unique  host  location,  i.e., 
an  object  of  a  distributed  type  may  "reside"  at  more  than  one  host  for 
reliability  and  performance  reasons.  An  occurrence  of  a  distributed  type  is 
given  a  UID,  the  Type  Manager  then  maps  the  operations  directed  to  this  UID 
into  a  set  of  operations,  which  are  executed  as  a  transaction,  on  the 
components  that  comprise  the  distributed  object's  concrete  representation. 
This  mapping  can  be  done  at  any  of  the  hosts  where  the  distributed  object  is 
conceptually  "residing".  The  operations  defined  for  a>'distrlbuted  type  are 
implemented  as  transactions. 


4 . 4  Summary 

The  system  designers  guidebook  presents  an  object-oriented  design  model 
that  supports  structuring  of  distributed  systems  for  high  reliability  and 
error  recovery.  In  this  model,  we  identify  the  error  recovery  problems  at  the 
different  levels  of  functional  abstraction  and  show  how  various  error  recovery 
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techniques  are  Integrated  Into  this  design  nodel.  For  example,  techniques 
based  on  multiple  versions,  logs,  careful  replacement,  and  differential  files 
are  used  for  constructing  recoverable  objects,  checkpointing.  Commitment 
techniques  are  used  for  constructing  atomic  transactions,  and  the  techniques 
based  on  replication  and  prlmary>backup  modes  of  operation  are  used  for 
constructing  reliable  distributed  objects.  The  use  of  this  model  In  the 
design  of  an  actual  distributed  operating  system  Is  the  topic  of  the  next 
chapter . 
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CHAPTER  5 


ZEUS:  AN  EXAMPLE  SYSTEM  DESIGN 


The  previous  chapter  presented  several  recovery  mechanisms  and  a  design 
model  for  constructing  reliable  distributed  systems.  This  design  model 
provides  a  framework  for  integrating  the  recovery  mechanisms  into  a  system 
design  in  a  structural  fashion.  Ideally,  a  distributed  operating  system 
should  make  the  low  level  recovery  mechanisms,  such  as  logs  and  commit 
protocols,  transparent  to  application  programmers  by  providing  some  high-level 
functions  for  constructing  reliable  software.  This  chapter  describes  a 
distributed  operating  system  called  Zeus  which  has  been  designed  with  this  in 
mind.  The  design  illustrates  how  various  recovery  mechanisms  are  integrated 
according  to  the  design  model  presented  in  the  previous  chapter. 

This  example  design  should  be  viewed  as  a  framework  for  integrating 
recovery  mechanisms  into  distributed  system  designs  rather  than  a  point 
solution.  As  mentioned  earlier,  the  approach  to  the  development  of  the  system 
designers  guidebook  is  example-driven.  This  approach  consists  of  designing  an 
example  system  which  illustrates  the  structuring  principles  as  well  as  the 
formal  design  definition  methods  for  reliable  system  designs.  Additionally, 
the  same  example  design  is  used  to  illustrate  the  application  of  design 
analysis  and  verification  method  to  reliable  distributed  systems  to  analyze 
their  performance,  reliability,  and  functional  correctness. 

This  chapter  presents  the  principles  followed  in  designing  Zeus,  an 
object-oriented  distributed  operating  system  for  integrating  recovery 
mechanisms  into  the  designs  of  distributed  command  and  control  systems.  The 
main  contribution  of  this  work  is  an  operating  system  design  that  provides  an 
integrated  set  of  functions  to  application  programmers  for  reliable  management 
of  objects  in  distributed  systems.  These  functions  transparently  provide 
complex  recovery  mechanisms,  commit  protocols,  concurrency  control  mechanisms 
[K0HL81]  [BERN81],  and  remote  object  accessing  to  application  programmers. 
For  now  the  primary  goal  of  the  Zeus  design  is  to  define  reliable  object 
management  functions  for  distributed  command  and  control  systems  and  to 
evaluate  the  performance  and  the  correctness  of  the  recovery  mechanisms  for 
these  functions;  therefore,  no  implementation  of  the  design  exists  at  this 
stage.  The  user  visible  functions  support  definition  of  object  types, 
creation  of  objects,  and  updating  of  distributed  objects  using  atomic 
transactions. 

A  distributed  operating  system  for  high  reliability  applications  must  not 
only  Include  suitable  recovery  mechanisms  that  are  transparent  to  the 
application  developers  but  it  should  also  provide  transparency  of  the 
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distributed  nature  of  the  system.  The  second  feature  is  important  to  make 
development  of  distributed  software  no  more  difficult  than  the  development  of 
conventional  software  systems.  The  Zeus  design  has  made  a  significant 
contribution  in  this  direction.  Some  other  systems,  such  as  LOCUS  [WALK83], 
have  integrated  these  two  concepts  in  their  designs;  however,  in  most  of  these 
systems,  object  management  is  limited  only  to  the  file  storage  level.  To 
date,  Argus  [LISK82]  is  the  only  other  system  besides  Zeus  which  provides  a 
set  of  general  mechanisms  for  reliable  management  of  distributed  objects  of 
any  type.  Zeus  not  only  provides  such  general  mechanisms,  but  also  addresses 
several  other  issues  not  Included  in  the  Argus  design  such  as  object  naming, 
object  relocation,  authentication  and  object  protection.  We  have  made  an 
effort  to  address  these  issues  In  the  Zeus  design  making  it  novel  as  compared 
to  any  other  distributed  operating  systems.  Another  novel  feature  is  the 
integration  of  the  conventional  database  management  functions  into  the 
operating  system  object  management  functions.  This  is  an  important  advance  in 
the  operating  system  designs  because  most  of  the  current  popular  operating 
systems  do  not  provide  efficient  mechanisms  for  database  applications 
[ST0N81].  Even  with  respect  to  its  recovery  model,  the  Zeus  design  differs 
significantly  from  other  known  designs. 


The  concept  of  object-oriented  design  has  been  used  in  some  recent 
distributed  system  designs  such  as  Cronus  [SCHA83]»  SWALLOW  [SVOB8I],  Argus 
[LISK82],  and  in  the  approach  presented  in  [SHRI8I].  Argus  provides 
object-oriented  linguistic  mechanisms  for  constructing  reliable  distributed 
systems,  and  SWALLOW  provides  reliable  object  management.  These  systems  do 
not  support  some  of  the  other  operating  system  functions  such  as  access 
control,  naming,  sharing,  and  resource  management.  Some  of  the  functions 
supported  by  Zeus,  such  as  naming,  authentication,  and  Interprocess 
communication,  can  be  found  in  Grapevine  [BIRJR82].  Grapevine  can  not  be 
regarded  as  a  general  purpose  distributed  operating  system  because  it  is 
intended  only  to  support  a  distributed  mail  system. 

The  design  of  the  Cronus  operating  system  has  significantly  influenced 
the  design  of  Zeus,  largely  because  both  these  systems  are  intended  for  highly 
reliable  applications  such  as  command  and  control  systems.  Zeus  provides 
users  with  reliable  object  management,  which  is  not  present  in  the  current 
design  of  the  Cronus  system.  Like  Cronus,  Zeus  has  the  character  of  a  general 
purpose  operating  system  mainly  because  the  nature  of  the  command  and  control 
applications  includes  a  wide  range  of  processing  characteristics.  This  is  in 
sharp  contrast  to  the  requirements  for  banking  or  airline  reservation  systems 
where  the  application  environment  is  well-defined.  Zeus  provides  capabilities 
for  defining  and  creating  objects  and  transactions  required  by  the  application 
systems.  It  also  provides  mechanisms  that  support  management  of  such  objects 
in  a  reliable  fashion.  Zeus  can  be  used  for  constructing  any  high  reliability 
application  system. 

This  chapter  presents  the  basic  object-oriented  building  block  mechanisms 
provided  by  the  Zeus  distributed  operating  system.  The  concept  of  object 
managers  is  the  basis  for  system  structuring.  An  object  manager  provides  the 
encapsulation  for  a  given  type  of  object;  all  objects  of  that  type  are 
accessed  or  updated  via  that  object  manager.  The  object-oriented  recovery 
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model  underlying  the  Zeus  design  is  described  in  Chapter  4.  In  this  model  the 
construction  of  reliable  distributed  objects  is  based  on  an  atomic  transaction 
facility  and  a  remote  procedure  call  mechanism. 

The  object  management  model  used  in  the  Zeus  design  is  based  on  the 
concepts  developed  in  the  Hydra  [COHE75]  design.  There  are  some  obvious 
differences  between  the  protection  models  used  in  the  Hydra  and  Zeus  designs. 
The  protecVlon  mechanism  in  the  Zeus  design  is  based  on  access  control  lists 
while  the  Hydra  model  is  capability  based.  Although  both  these  models  are 
equivalent  in  terms  of  their  functionality,  they  differ  with  respect  to  their 
operational  environment.  The  prime  reason  for  using  the  access  control  list 
model  in  our  design  is  to  be  able  to  change  the  access  rights  dynamically. 
Although  it  is  not  very  efficient  to  change  access  rights  dynamically  in  a 
capability  based  system,  it  is  important  in  a  command  control  system  where 
some  of  the  nodes  might  be  taken  over  by  hostile  forces. 

5 . 1  Structure  of  the  Zeus  System 

Zeus  is  essentially  a  collection  of  Type  Managers  (TMs);  typically,  many 
different  Type  Managers  coexist  on  a  host  node.  The  core  of  the  operating 
system  consists  of  a  set  of  Type  Managers  that  support  capabilities  for 
defining  new  types  and  object  instances  in  the  system,  authentication  of 
users,  naming  environment  for  each  user,  and  reliable  process  and  transaction 
management  functions.  These  system-defined  Type  Managers  reside  at  every  node 
in  the  system.  The  lowest  level  of  operating  system  at  each  node  is  called 
the  kernel;  the  kernel  virtualizes  the  resources  at  the  host  so  that  each  Type 
Manager  can  be  viewed  as  having  its  own  virtual  processor.  The  kernel 
supports  interprocess  communication,  primary  storage  management,  processor 
scheduling,  interfaces  to  secondary  storage  devices,  and  UID  generation;  it 
also  handles  all  interrupts  due  to  storage  devices  and  the  communication 
devices.  Figure  5-1  shows  the  major  components  of  the  Zeus  system. 

5.1.1  Structure  of  the  Zeus  Kernel 

The  Zeus  kernel  provides  low  level  services  to  the  Type  Managers  of  the 
system.  These  services  include  three  important  functions  1)  interprocess 
communication,  2)  storage  management  and  3)  unique  identifier  (UID) 
generation.  The  UID  generation  in  turn  depends  on  the  failure  detection  and 
recovery  of  hosts  in  the  Zeus  system.  The  kernel  consists  of  a  task 
dispatcher  and  a  number  of  interrupt  handlers.  The  task  dispatcher  schedules 
the  different  Type  Managers  at  its  host  node  and  handles  their  requests  for 
resources.  It  also  handles  the  restart  of  the  system  and  initiation  of  the 
Type  Managers.  The  resources  managed  by  the  kernel  include  volatile  and 
non-volatile  storage,  the  processor  and  the  communication  handler.  The  kernel 
interface  consists  primarily  of  three  parts;  invocation  requests  to  other 
Type  Managers,  requests  for  unique  numbers,  and  requests  for  resources. 

Interprocess  communication  is  achieved  by  the  mechanism  of  remote 
procedure  call  (RPC)  which  consists  of  four  messages  interchanged  between 
caller  and  callee.  These  are  call,  call  acknowledge,  response  and  response 
acknowledge.  For  each  call  that  is  made  from  or  to  a  Type  Manager  the  status 
of  the  call  parameters  and  status  must  be  stored.  To  do  this  each  Type 
Manager  has  a  call  handler  to  perform  this  function.  The  synchronous  nature 
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of  the  RPC  Is  achieved  by  the  Type  rlanagers  who  will  first  Issue  a  call  and 
then,  on  getting  the  response,  will  infora  the  caller  of  It. 

The  storage  functions  of  the  kernel  are  performed  at  the  object  level; 
thus,  calls  to  the  kernel  can  retrieve,  store  and  delete  objects.  Further 
stable  storage  operations  can  be  executed  by  the  kernel,  where  stable  storage 
is  Implemented  using  the  Lampson  [LAHP81]  scheme.  Storage  management  In  the 
kernel  Is  minimal.  Storage  Is  available  In  fixed  sized  blocks  and  the  Type 
Managers  request  one  or  more  of  these  blocks  at  any  time.  A  Type  Manager  Is 
solely  responsible  for  the  data  it  writes  to  the  blocks  of  storage.  The 
kernel  keeps  track  of  the  ownership  of  blocks  of  storage.  The  routing  of 
invocation  requests  to  Type  Managers  Is  the  major  function  of  the  kernel. 
Each  call  Is  an  operation  Invoked  against  an  object  that  Is  held  by  some  Type 
Manager.  Operation  Switch,  which  Is  a  component  of  the  kernel,  supports  this 
function. 

UID  generation  Is  a  function  used  by  the  RPC  and  by  the  Type  Managers  so 
that  calls  and  objects  can  be  uniquely  Identified.  This  function  must 
continue  despite  failure  and  recovery  of  hosts.  To  achieve  this  the  hosts 
participate  In  a  distributed  computation  to  keep  track  of  active  hosts  and  to 
let  new  or  recovered  hosts  Join  In  the  UIO  generation  function. 

The  function  of  the  Operation  Switch  is  to  forward  an  Invocation  request 
to  the  appropriate  Type  Manager  at  a  local  or  a  remote  node.  These  calls  nay 
be  from  a  Type  Manager  or  from  the  network  driver.  Each  call  contains  the 
following  Information: 

1.  The  extended  UIO  of  the  object  against  which  the  call  Is  Invoked. 

2.  The  extended  UID  of  the  process  Invoking  the  operation. 

3.  The  extended  UID  of  the  principal  on  whose  behalf  the  operation  Is 
being  invoked. 

The  operation  and  a  set  of  parameters. 


The  Operation  Switch  uses  the  host  hint  field  of  the  target  object's 
extended  UID  to  deterelne  whether  the  object  is  on  the  host  or  not.  If  it  is, 
it  uses  the  type  unique  nuaber  of  the  object  to  direct  the  call  to  the  proper 
Type  Manager.  If  the  object  is  on  another  host,  the  Operation  Switch 
Instructs  the  Network  Handler  to  send  the  call  to  the  other  host. 

5.1.2  Systee-Defined  Type  Managers 

As  mentioned  previously,  Zeus  is  a  set  of  Type  Managers  whose  eeebers  may 
potentially  change  dynaalcally  as  Type  Managers  are  created,  deleted,  and 
modified.  There  is,  however,  a  subset  of  Type  Managers  called  the  Systea  Type 
Manager  which  perfora  the  essential  services  provided  by  the  kernel  of  a 
conventional  operating  systea.  In  this  section,  the  Type  Managers  for  these 
system  types  are  defined.  The  following  are  the  Systea  Type  Managers  which 
exist  at  each  node  in  the  systea: 

( 1 )  Type-Type  Manager 

(2)  Process/Transaction  Manager 

(3)  Principal  and  Authentication  Manager 

(ii)  Symbolic  Naae  Manager 

(5)  Program  Type  Manager 

(6)  Message  Type  Manager 

The  definitions  of  new  Type  Managers  is  introduced  in  the  systea  by  using 
the  mechanisms  supported  by  a  system-wide  object  called  the  Type-Type  Manager; 
thus,  the  Type-Type  Manager  implements  functions  to  create,  alter,  delete  and 
replicate  Type  Managers.  The  definition  of  the  Type-Type  object  given  here  is 
in  adaptation  and  extension  of  the  Type-Type  concepts  originating  in  the  HYDRA 
[WULFSl]  operating  systea.  The  facilities  provided  by  the  Type-Type  Manager 
include  an  explicit  command  for  locating  the  copies  of  a  Type  Manager. 

The  Process/Transaction  Manager  provides  the  reliable  manageaent  of 
processes  and  their  operations  in  the  systea.  The  atoaic  action  facility, 
called  transaction,  forms  the  basic  mechanlsa  for  building  reliable 
applications  including  manageaent  of  distributed  objects. 

The  Symbolic  Name  Manager  and  the  Message  Type  Manager  can  be  regarded  as 
applications  built  using  the  Process  Manager  functions.  The  Syabolic  Name 
Manager  maintains  the  naae  contexts  for  the  clients  in  the  system.  Thus,  a 
client  can  use  string  names  instead  of  UIDs  for  accessing  objects;  the 
Symbolic  Name  Manager  translates  the  string  names  to  object  UIOs  depending  on 
the  context  of  their  use.  The  Message  Type  Manager  supports  message 
cmmunication  among  the  clients.  The  Program  Type  Manager  supports  building 
executable  program  objects  froa  a  set  of  specified  code  segments.  It  has  the 
conventional  functions  of  a  linker  and  loader.  The  Principal  and  Access 
Control  Manager  has  the  function  of  associating  appropriate  access  rights  with 
the  processes  in  the  system  which  carry  out  operations  on  behalf  of  system 
users. 

5.1.3  Process  Manageaent 

Processes  are  active  objects  that  perfora  state  changes  on  behalf  of 
system  users  by  modifying  shared  permanent  objects.  They  have  a  (systea 
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defined)  type,  PROCESS,  and  are  managed  by  an  Type  Manager  called  the  Process 
Manager.  Transactions  are  PROCESS  objects  with  the  additional  property  of 
atomicity.  Atomicity,  or  the  "all  or  nothing"  property,  means  that  either  all 
or  none  of  a  transaction's  updates  become  permanent.  The  TRANSACTION  type  is 
derived  from  the  PROCESS  type;  thus,  all  operations  defined  on  processes  are 
applicable  to  transactions.  Additional  operations  are  defined  for  transaction 
objects.  Shared  objects  can  be  updated  reliably  only  by  transactions. 

Reliable  applications  are  built  in  Zeus  by  manipulating  objects  using 
transactions.  The  Zeus  kernel  offers  only  unreliable  remote  procedure  calls 
[NELS81]  [LAMPSib]  which  are  made  reliable  by  invoking  them  within  a 
transaction.  The  transaction  facility  also  provides  a  powerful  mechanism  for 
managing  replicated  or  partitioned  objects  reliably.  This  section  presents 
the  computational  model  for  managing  processes  and  transactions,  and  the 
application  visible  operations.  These  operations  are  summarized  in  Table  5-1. 

The  Zeus  design  uses  the  transaction  concept  for  reliability  and  to  avoid 
the  domino  effect  during  process  rollback  by  enforcing  disciplined 
interactions  among  processes.  First,  all  information-flow  among  processes 
which  affects  global  state  takes  place  via  shared  objects.  Second,  all  shared 
global  objects  must  be  accessed  within  a  transaction.  A  transaction  defines  a 
"sphere  of  control"  (SOC)  [DAVI73];  all  objects  modified  by  a  transaction  are 
said  to  belong  to  its  sphere  of  control.  Third,  no  other  process/transaction 
is  allowed  to  access  objects  belonging  to  a  transaction's  sphere  of  control 
during  that  transaction's  execution.  If  the  transaction  completes 
successfully,  the  updated  objects  are  "committed";  otherwise,  they  are 
restored  to  their  state  before  the  transaction  began  execution. 

A  process  can  create  sequential  and  concurrent  transactions.  The  parent 
process  of  a  sequential  transaction  is  suspended  until  that  transaction 
terminates.  However,  the  parent  process  of  a  concurrent  transaction  process 
executes  concurrently  with  its  child.  When  a  concurrent  transaction  process 
terminates,  an  appropriate  condition  is  signaled  to  the  parent  process. 

The  Zeus  design  allows  a  transaction  to  invoke  other  (sequential  and 
concurrent)  transactions,  called  nested  transactions  [M0SS81]  [RIES82].  A 
top-level  transaction  is  one  whose  parent  is  a  non-transaction  type  process. 
Zeus  supports  nested  transactions  (i)  to  Introduce  concurrency  into  an  atomic 
action,  and  (ii)  to  allow  a  transaction  to  invoke  procedures  which  may  contain 
transactions.  Nested  transactions  also  provide  a  means  for  constructing 
recovery  blocks  [HORN74],  and  updating  replicated  objects  using  majority 
consensus  [THOM79]  or  weighted  voting  [GIFF79]. 
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Table  5>1:  Application  Visible  Operations 


OPERATION 

REMARKS 

INVOKE 

input  parameters:  UI0(1) 

of  object  to  which  operation  is 

applied,  operation  name  and  operation  parameters 

CREATE_PRXESS 

if  successful,  returns  UID  of  the  new  process, 
otherwise,  returns  error  signal 

DELETE_PROCESS 

input  parameter:  UID  of  process  to  be  deleted 

BEGIN_TRANSACTION 

if  successful,  returns  UID  of  the  new 

sequential  transaction;  otherwise,  returns  error  signal 

CREATE_TRANSACTION 

if  successful,  returns  UID  of  the  new 

concurrent  transaction;  otherwise,  returns  error  signal 

END_TRANSACTION 

initiates  commit  protocol  between  Process 

Manager  and  Object  Managers 

WAIT 

input  parameters:  transaction  UID(s)  on  which  parent, 
waits,  optional  timeout  value 

COMMIT 

invoked  by  processes  only;  input  parameters: 
transaction  UID 

ABORT 

cancels  all  of  a  transaction's  updates 

ESTABLISH  RECOVERY 
POINT 

returns  recovery  point  number 

DISCARD  RECOVERY 
POINT 

input  parameter:  recovery  point  number 

ROLLBACK 

input  parameter:  recovery  point  number;  without 
parameter,  process  rolls  back  to  most  recent  recovery 
point 

To  create  a  sequential  transaction,  a  parent  process  or  transaction 
invokes  the  BBGIN_TRANSACTION  function.  The  parent  process  Is  then  suspended 
until  its  child  terminates.  The  sequential  transaction  created  by 
BEGIN_TRANSACTION  inherits  its  parent's  address  space  and  runtime  environment. 
The  transaction  terminates  by  executing  either  END_TRANSACTIOM  or  ABORT. 
Invoking  END^TRANSACTION  causes  the  Process  Manager  to  execute  commit 
protocols  with  the  Type  Managers  (also  called  object  managers)  of  the  objects 


(1)  A  UID  is  a  globally  Unique  Identifier;  every  process,  transaction  and 
object  in  the  system  has  a  UID. 
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accessed  by  the  transaction.  The  code  between  a  6^IN_TRANSACTION  and  a 
corresponding  END_TRANSACTION  Is  executed  as  an  atonic  action,  that  is,  as  a 
transaction. 

A  process  or  transaction  creates  a  concurrent  transaction  by  invoking 
CREATE_TRANSACTION.  When  a  concurrent  transaction  completes,  a  condition  Is 
signalled  to  Its  parent.  At  this  point  the  child  transaction  Is  still  not 
committed;  it  Is  either  in  the  aborted  or  the  commit-pending  state.  The 
commlt-pendlng  state  indicates  that  the  transaction  was  successful  and  is 
waiting  for  Its  parent  to  Issue  a  commit  command.  If  the  parent  Is  a 
non-transaction  process,  then  It  explicitly  Issues  the  COMMIT  command.  Nested 
transactions  are  Implicitly  committed  when  the  top-level  transaction  commits. 

A  parent  process  or  transaction  can  wait  for  a  completion  signal  from  a 
concurrent  child  transaction  by  Invoking  the  WAIT  function.  WAIT  can  Include 
a  time-out  option  will  cause  the  Invoker  to  be  suspended  until  either  the 
transaction  completes  or  an  Interval  of  time  passes.  A  process  may  wait  on 
any  of  several  transactions,  or  until  each  of  a  set  of  transactions  has 
completed. 

Processes  and  transactions  perform  operations  on  shared  global  objects 
using  the  INVOKE  function.  Remote  and  local  shared  global  objects  are 
accessed  Identically. 

A  top-level  transaction  nay  make  a  commit  decision  based  on  the  status  of 
its  nested  transactions  (e.g.,  completed  or  aborted).  It  is  undesirable  to 
require  a  top-level  transaction  to  revalidate  the  state  of  the  objects 
accessed  by  a  completed  nested  transaction  If  the  top-level  transaction 
decides  to  commit.  Revalldatlon  of  object  states  can  be  avoided  If  a  nested 
transaction  follows  an  appropriate  commit  protocol.  A  nested  transaction  can 
follow  either  a  one-phase  or  two-phase  commit  protocol  [BALT811  with  its 
parent.  Using  a  one-phase  commit  protocol  means  that,  when  a  nested 
transaction  completes,  all  ti.-  objects  it  modified  are  in  the  commit- pending 
state.  The  commit-pending  versions  cannot  be  aborted  unilaterally  by  an  Type 
Manager.  In  contrast,  following  a  two-phase  commit  protocol  leaves  modified 
objects  in  the  uncommitted  state.  Such  uncommitted  versions  can  be  aborted 
unilaterally  by  their  Type  Managers,  thereby  aborting  that  nested  transaction. 

Zeus  uses  the  one-phase  commit  option  for  nested  transactions.  This 
allows  the  use,  within  a  transaction,  of  a  conditional  statement  that  depends 
on  the  successful  completion  of  one  of  the  transaction's  nested  children. 
Such  conditionals  nay  be  used  because  the  one-phase  commit  option  prevents  a 
unilateral  abort  by  an  object  manager  from  invalidating  conditional  decisions 
made  by  the  parent  transaction.  It  also  eliminates  the  need  for  a  parent 
transaction  to  revalidate  the  status  of  completed  nested  transactions. 

Object  managers  and  process  managers  follow  a  two-phase  locking  protocol 
[ESWA76]  so  that  all  concurrently  executing  transactions  are  serializable. 
All  concurrently  executing  nested  transaction  with  the  same  parent  are  also 
serializable  according  to  these  rules. 
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A  process  or  transaction  nay  establish  a  recovery  point  by  invoking 
ESTABLISH_REC0VERY_P0I1IT  (ERP).  When  ERP  is  Invoked,  the  Process  Manager 
saves  thi  current  state  of  the  process  on  stable  storage  and  returns  a 
recovery  point  nuaber  to  the  calling  process.  A  process  can  explicitly  roll 
back  to  sone  previous  recovery  point  by  invoking  the  ROLLBACK  function.  If  no 
parameters  are  given,  the  calling  process  rolls  back  to  its  last  recovery 
point.  If  a  recovery  point  nuaber  is  supplied,  the  process  rolls  back  to  that 
recovery  point. 

It  is  possible  to  establish  a  recovery  point  for  a  parent  process  when  a 
sequential  transaction  connits  by  using  the  ERP  option  with  the 
END_TRANSACTION  command.  If  the  parent  process  subsequently  crashes,  it  would 
be  "started  either  from  this  recovery  point  or  from  a  subsequent  recovery 
point,  avoiding  re-execution  of  a  transaction  that  has  already  been  committed. 

The  Process  Manager  establishes  the  initial  state  of  every  process  as  the 
recovery  point  numbered  0.  All  subsequent  calls  to  ERP  return  sequentially 
increasing  integer  numbers.  When  a  process  completes,  all  of  its  recovery 
points  are  discarded.  A  process  can  also  discard  any  of  its  recovery  points 
by  invoking  DISCARD_REC0VERY_P0INT  (DRP),  with  the  number  of  the  recovery 
points  to  be  discarded. 

5.2  Formal  Definitions  of  the  Designs 

A  major  part  of  detailed  design  of  the  Zeus  operating  system  that 
includes  the  design  of  the  Process/Transaction  Manager  and  the  Generic  Object 
Manager  has  been  done  using  Concurrent  System  Definition  Language  (CSDL). 
These  designs  are  presented  in  Volume  II  of  the  guidebook. 

CSDL  is  Intended  for  designing  systems  with  inherent  concurrency  (for 
example,  geographically  distributed  systems),  systems  in  which  concurrency  is 
needed  to  deliver  adequate  performance,  or  for  which  expressing  the  design  as 
a  collection  of  concurrent  modules  leads  to  a  simpler,  more  understandable 
design. 

There  are  two  basic  concurrent  architectures:  the  static  architecture  in 
which  the  system  is  created  with  a  fixed  number  of  nodules  which  persist 
throughout  its  lifetime,  and  the  dynamic  architecture  in  which  modules  are 
created  as  needed  to  handle  new  tasks.  CSDL  supports  them  both.  Following 
are  the  salient  features  of  the  CSDL  methodology: 

1.  A  formal  model  of  sequential  and  concurrent  computations. 

2.  A  system  model  that  characterizes  the  building  blocks  with  which  systems 
nay  be  designed. 

3.  Methodological  principles  and  guidelines  that  define  desirable  properties 
of  the  design  activity,  the  design  language  and  the  design  itself,  and 
make  procedural  suggestions  for  carrying  out  the  design  process. 

4.  Technical  methods  essential  for  engineering  software.  They  are,  for 
example,  data  abstraction,  procedural  abstraction,  Dljkstra's  constructive 
approach,  and  the  like. 
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5.  A  description  language  -  a  formal  notation  for  describing  how  a  system  is 
built  up  from  pieces  and  how  those  pieces  are  connected.  Its  semantics 
are  based  on  the  model  of  computation. 

6.  A  specification  language  -  a  formal  notation  for  documenting  the  expected 
behavior  of  a  system  description.  Its  semantics  sure  based  on  the  formal 
model . 

7.  Analytic  methods  for  investigating  operational  properties  such  as 
performance,  reliability,  or  security  of  alternative  functionally  correct 
system  designs. 

These  elements  are  applied  to  detailed  design,  the  development  phase 
whose  work  product  is  a  design  documenting  a  system's  logical  architecture, 
its  paths  of  information  flow,  the  data  type  of  each  system  object  and  the 
behavior  of  the  system  and  each  of  its  modules.  A  detailed  design  expresses 
what  will  actually  be  implemented.  Each  object  in  the  design  —  nodule,  data 
object,  procedure,  or  information  flow  path  —  will  exist  in  the 
implementation,  though  the  object's  physical  realization  may  be  different  from 
its  logical  design.  For  example,  a  type  operation  designed  as  a  procedure  may 
be  implemented  with  in-line  code. 

In  CSDL,  the  basic  locus  of  control  is  the  machine.  The  machine  is  a 
container  of  objects  and  a  control  procedure  in  execution.  A  machine  nay 
contain  data  objects  of  any  type.  A  machine  may  also  contain  machine-objects, 
that  is,  other  machines  in  operation,  and  pools  of  machine-objects  from  which 
operating  machines  may  be  created  and  destroyed.  These  structures  (the 
machine-object  and  the  pool  of  machine-objects)  enable  a  single  machine  to 
contain  several  concurrently  operating  local  loci  of  control. 

A  machine  definition  consists  of  a  list  of  the  machine's  public  objects 
and  specifications  of  the  machine's  externally  visible  behavior.  Public 
objects  are  those  (active  and  passive)  machine  objects  which  define  the 
external  view  of  the  machine.  A  machine's  realization  is  guaranteed  to  have 
these  objects.  A  machine  communicates  with  its  environment  through  its  active 
public  objects.  Its  passive  public  objects  are  visible  to  the  environment, 
but  cannot  be  manipulated  by  it.  Public  objects  are  used  in  specifications  of 
the  machine's  externally  visible  behavior.  Machine  specifications  may  specify 
initial  values,  invariant  properties  and  machine  behavior. 

A  concurrent  system  is  a  collection  of  machines  which  operate 
concurrently  and  autonomously.  They  communicate  asynchronously  by  passing 
information.  Internally,  a  machine  consists  of  data  objects  and  procedures 
and/or  subordinate  machines  to  manipulate  these  objects.  A  machine  containing 
only  procedures  constitutes  a  sequential  locus  of  control.  A  machine 
containing  subordinate  machines  constitutes  several  autonomous  control  sites. 
If  the  system's  architecture  is  viewed  as  a  tree,  its  leaves  are  all 
sequential  control  sites. 

Machines  may  also  contain  machine  pools  from  which  machine  instances  may 
be  created  and  destroyed  as  the  system  runs. 
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Systems  are  evolutionary.  The  initial  system  configuration  is  described 
by  a  distinguished  machine,  SYSTEM.  SYSTEM  may  contain  other  machines  and 
machine  pools.  Each  machine  that  SYSTEM  contains  may,  in  turn,  contain  other 
machines  and  machine  pools.  The  initial  system  is,  then,  the  configuration 
consisting  of  SYSTEM,  all  machines  it  contains,  and  all  the  machines  they 
contain.  A  system  evolves  by  dynamic  creation  and  destruction  of  machines 
from  pools.  Since  every  pool  element  may  contain  machines  and  machine  pools, 
creating  a  new  machine  dynamically  may,  in  effect,  create  a  new  subsystem. 

A  system's  communication  architecture  is  the  set  of  connections  among  its 
machines.  Connections  are  formed  among  active  objects,  objects  whose  values 
can  change  without  being  manipulated  by  the  machine  which  contains  them. 
Since  machines  cannot  manipulate  each  other's  objects,  a  communication  link  is 
set  up  by  connecting  an  active  object  in  one  machine  to  a  complementary 
(roughly  same  type,  opposite  direction)  active  object  in  another.  The  sending 
machine  puts  a  value  in  its  local  active  object,  and  that  value  is 
instantaneously  transmitted  to  the  complementary  active  object  from  which  the 
other  machine  can  get  it  by  a  local  operation.  Active  objects  may  be 
connected  to  realize  point-to-point,  multi-cast,  fan-in  and  broadcast 
communication  architectures.  Connected  active  objects  by  definition 
correspond  to  shared  objects  in  the  computational  model. 

5.3  Summary 

The  object  oriented  design  model  presented  in  the  previous  chapter  is 
used  for  designing  Zeus,  a  distributed  operating  system  for  reliable 
applications.  A  Zeus  system  is  essentially  a  collection  of  Type  Managers; 
each  Type  Manager  is  responsible  for  managing  the  objects  of  its  associated 
type.  A  set  of  system-defined  Type  Managers  provides  certain  primitives  for 
building  reliable  application  systems.  An  atomic  action  facility  is  the  basic 
mechanism  in  Zeus  for  building  reliable  applications.  An  atomic  action  in 
Zeus,  called  transaction,  can  span  over  several  distributed  sites  in  the 
system.  The  purpose  of  the  Zeus  design  is  to  illustrate  certain  design 
principles  for  building  reliable  distributed  systems;  it  is  not  intended  as  a 
point  solution  but  rather  a  framework  for  system  designs.  In  designing  a 
system,  the  designer  has  to  go  through  several  steps  starting  with  its 
conceptual  design  to  the  implementation.  A  formal  notation  which  supports 
expression  of  the  system  definition  in  a  clear  and  systematic  fashion  is  the 
single  most  important  tool  for  the  designer.  Concurrent  System  Definition 
Language  (CSDL)  is  Intended  to  serve  as  a  formal  design  notation.  A  major 
part  of  the  Zeus  design  was  defined  using  CSDL. 


CHAPTER  6 


ANALYSIS  AND  VALIDATION  TECHNIQUES 


6.1  Introduction 


Reliability,  timeliness  and  correctness  of  system  functions  are  the  most 
critical  attributes  of  a  command  and  control  system.  A  major  part  of  the 
system  designers  guidebook  Is  devoted  to  the  techniques  and  tools  for 
analyzing  these  properties  of  reliable  distributed  system  designs.  The  recent 
approaches  to  system  designing  advocate  that  the  design  analysis  activities 
should  proceed  concurrently  with  the  design  activity  in  a  tightly  coupled 
fashion;  each  design  step  needs  to  be  validated  to  ensure  that  the  design 
decisions  would  lead  to  the  desired  performance  and  reliability  goals.  This 
chapter  presents  a  brief  overview  of  the  major  accomplishments  towards  this 
goal.  The  presentation  here  is  divided  into  three  major  sections.  The  first 
section  describes  the  techniques  for  modeling  fault-tolerant  sytems  using  PAWS 
for  the  performance  evaluation.  The  second  section  deals  with  the  reliability 
analysis  techniques,  and  the  third  section  Is  devoted  to  the  techniques  for 
proving  or  validating  the  correctness  of  recovery  mechanisms  In  distributed 
systems. 

6.2  Performance  Evaluation  Of  Recovery  Mechanisms 


The  first  concern  of  a  system  designer  is  generally  the  correct 
functionality  of  the  system  he  Is  designing.  It  is,  of  course,  imijeratlve 
that  a  system  correctly  performs  the  tasks  for  which  it  Is  Intended.  Until 
very  recently,  designers  did  not  concern  themselves  with  the  costs,  in  terms 
of  resources  and  time,  of  providing  this  functionality  until  after  some  or  all 
of  the  system  was  operational.  Early  performance  predictions  and  the 
resulting  design  iterations  are  especially  important  in  the  design  of  highly 
reliable  systems.  This  Is  because,  unlike  functional  correctness,  the 
reliability  of  certain  functions  or  modules  might  be  negotiable.  If  the  cost 
of  a  reliable  function  is  too  high,  the  designer  might  be  willing  to  accept  a 
lower  degree  of  reliability  for  that  function  which  Is  not  so  extravagant  with 
system  resources.  Such  tradeoff  decisions  can  only  be  made  if  the  designer 
has  at  his  disposal  early  estimates  of  performance  and  reliability. 


The  performance  analysis  part  of  the  design  evaluation  phase  is  concerned 
with  providing  quantitative  estimates  for  certain  resource,  utilization 
performance  measures.  Exactly  which  measures  are  interesting  to  the  system 
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designer  and  at  what  level  of  detail  these  estimates  are  to  be  made  are 
questions  for  which  It  Is  Important  to  have  answers  before  proceeding  with  the 
analysis  effort. 

6.2.1  Performance  Measures 

There  are  several  generic  performance  measures  which  are  typically  used 
to  describe  and  quantify  the  performance  of  computer  systems.  These  fall  Into 
two  distinct  categories:  user-  oriented  measures  and  system-oriented 
measures. 

The  user-oriented  measure  most  often  used  with  respect  to  Interactive 
systems  Is  response  time  (turnaround  time  for  batch  systems).  Response  time 
is  the  elapsed  time  between  the  arrival  of  a  request  and  the  completion  of 
that  request  by  the  system.  Of  course  the  exact  moments  of  "arrival"  and 
"completion"  of  a  request  must  be  carefully  defined  for  any  given  application. 

The  two  system-oriented  measures  most  commonly  encountered  are  throughput 
and  utilization.  Throughput  Is  defined  as  the  average  number  of  requests 
processed  by  the  system  per  unit  of  time.  This  Is  typically  not  a  very  useful 
measure  of  system  performance  since,  as  long  as  the  system  Is  performing  well 
enough  so  that  it  can  complete  requests  without  creating  an  ever-  Increasing 
backlog,  the  throughput  of  the  system  is  equivalent  to  the  average  arrival 
rate  of  the  requests.  Utilization  Is  defined  to  be  the  fraction  of  time  that 
a  particular  resource  Is  busy  -  that  is,  working  on  some  request. 


6.2.2  Models  and  Hierarchical  Structuring 

For  operational  systems,  the  most  straightforward  approach  to  performance 
evaluation  Is  to  directly  measure  the  performance  using  some  combination  of 
hardware  and  software  monitors.  This,  of  course,  Is  impossible  during  the 
design  phases  of  a  system  since  there  is  nothing  yet  to  measure.  In  such 
cases  when  direct  measurement  Is  Impractical  or  Impossible,  a  model  of  the 
system  must  be  devised  which  captures  the  salient  factors  that  determine 
system  performance.  The  model  Is  then  evaluated  and  the  performance  measures 
thus  obtained  are  used  as  estimates  for  the  performance  measures  of  the  actual 
system. 

The  complexity  of  such  models  and  the  degree  to  which  they  represent  or 
abstract  from  the  actual  system  determine  to  a  large  extent  the  sunount  of 
effort  and  expense  required  to  evaluate  them.  Generally,  the  more  detailed 
the  model,  the  more  expensive  it  Is  to  evaluate.  Luckily,  during  the  design 
phases  of  a  system,  there  Is  normally  not  a  requirement  for  extremely  accurate 
estimates  of  system  performance.  We  are  typically  more  Interested  In 
rejecting  those  designs  which  have  a  very  negative  Impact  on  performance  and 
in  providing  guidance  as  to  which  parts  of  the  design  should  be  considered  for 
optimization;  Therefore,  performance  models  constructed  during  the  design 
stage  are  normally  simpler  and  more  abstract  relative  to  the  actual  system. 

Even  so,  modeling  a  system  which  has  many  Interconnected  parts,  even  at  a 
very  abstract  level,  often  produces  overall  models  which  are  large,  complex, 
and  for  which  evaluation  Is  intractable.  The  solution  to  this  problem  Is  the 
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saM  aa  the  classical  solution  to  the  general  problea  of  softuare  coapleiity: 
hierarchical  structuring.  Models  which  are  decoaposed  into  several  saaller 
sections  and  structured  vertically  or  hierarchically  as  in  Figure  6-1  prove  to 
be  both  aore  aanageable  and  easier  to  evaluate  [KOBA78]  [BROH75].  Such 
hierarchical  structuring  allows  the  analyst  to  suaaarize  the  perforaance 
results  obtained  froa  evaluating  one  level  of  the  aodel  (say  the  alcro  level 
in  Figure  6-1)  in  a  fora  which  is  easily  usable  in  the  next  higher  level  (the 
interaedlate  level  in  Figure  6-1). 

The  decoaposltlon  of  a  aodel  into  a  hierarchy  of  sub-aodels  should  take 
into  account  the  inherent  structures  of  the  aachine  configuration  as  well  as 
the  systea  being  aodeled.  A  coaaon  rule  of  thuab  criterion  [K0BA78}  is  that 
the  tiae  constant  at  a  given  level  of  the  aodel  should  be  significantly 
saaller  than  the  average  inter-event  tlaes  at  the  next  higher  level.  In  other 
words,  a  large  nuaber  of  state  change  events  should  occur  at  the  lower  level 
between  events  at  the  next  higher  level.  In  Figure  6-1,  for  exaaple,  the 
alcro  level  aodels  night  have  typical  inter-event  tlaee  on  the  order  of 
aicro-seconds  or  nano-seconds,  the  interaedlate  level  on  the  order  of 
allli-seconds,  and  the  aacro  level  on  the  order  of  seconds. 

For  object-oriented,  high-integrity  systens,  there  are  a  nuaber  of 
convenient  levels  of  detail  for  which  perforaance  neasures  nay  be  obtained. 
This  natural  hierarchy  of  aodellng  layers  is  illustrated  in  Figure  6-2. 


Hiwarchical  Modal  Stnictura.  Ob|act-Orianlad  Modalino  Hiarafchy. 

Rguf«6-1  Flguio6-2 
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6.2.3  Parts  of  a  Performance  Model:  System,  Environment,  Workload 

There  are  really  three  distinct  factors  which  impact  the  development  of 
performance  models  for  computer  systems.  The  most  obvious  of  these  is,  of 
course,  the  structure  of  the  system  which  is  to  be  modeled.  As  previously 
mentioned,  the  structure  of  the  model  might  not  operationally  reflect  the 
structure  of  the  actual  system  but  night  rather  abstract  from  it  the  main 
features  which  affect  performance.  It  is  believed,  however,  that  the  object- 
oriented  system  structure  discussed  widely  in  this  guidebook  will  simplify  the 
task  of  producing  performance  models  of  the  system.  This  is  because  of  some 
of  the  same  reasons  that  make  this  approach  highly  suitable  for  the 
formulation  of  highly  reliable  distributed  systems:  inherent  modularity  and 
hierarchical  structuring. 

In  addition  to  modeling  the  structure  of  the  system,  the  environment  In 
which  the  system  must  operate  must  also  be  considered.  The  environment 
includes  such  things  as  the  native  hardware  and  software  in  which  the  system 
is  to  be  embedded  and,  of  particular  interest  in  the  modeling  of  reliable 
systems,  the  fault  characteristics  of  that  hardware/software  configuration. 

Finally,  the  workload  which  the  system  will  be  expected  to  accommodate 
must  also  be  modeled  in  some  way.  Choosing  an  appropriate  workload  and  a 
representation  for  it  is  less  of  a  problem  for  existing  operational  systems 
although  it  is  still  very  much  an  art  and  still  very  difficult  to  do.  For 
systems  which  are  not  yet  operational,  the  problem .becomes  one  of  choosing  or 
Inventing  a  hypothetical  workload  which  will  hopefully  reflect  the 
characteristics  of  the  future  workload  of  the  actual  system. 

In  order  to  obtain  useful  predictions  of  performance  measures  from  the 
models,  they  must  be  evaluated  in  some  way.  Performance  model  evaluation  is 
the  process  by  which  values  are  derived  for  the  chosen  performance  indices 
given  a  "correct"  and  properly  parameterized  model  and  an  appropriate 
workload.  Once  a  validated  model  has  been  constructed,  it  must  be  evaluated 
to  obtain  values  for  the  performance  indices  of  interest.  Models  may  be 
evaluated  analytically,  by  simulation  techniques,  or  by  some  hybrid 
combination  of  the  two. 

6.2.3. 1  Analytic  Methods 

In  [K0BA78],  an  analytic  evaluation  method  is  defined  as,  "a  solution 
technique  that  allows  us  to  write  a  functional  relation  between  system 
parameters  and  a  chosen  performance  criterion  in  terms  of  equations  that  are 
analytically  solvable."  The  term  "analytically  solvable"  here  is  usually 
taken  to  Include  numerical  solution  methods  other  than  simulation  as  well  as 
closed-form  solutions.  Although  such  a  definition  of  analytic  solvability 
includes  deterministic  techiques  like  automata  theory  and  Petri  nets,  the  tern 
is  most  often  used  to  refer  to  the  mathematical  discipline  called  queueing 
theory.  Mathematical  queueing  theory  provides  a  framework  in  which  networks 
of  resources  (CPU,  memory,  I/O  devices,  etc.)  are  being  prevailed  upon  by 
Jobs  to  perform  some  services.  Contention  for  a  resource  causes  Jobs  to  be 
queued  for  later  service. 
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6. 2. 3. 2  Simulation  Methods 

When  evaluating  an  hierarchically  structured  model  of  a  large  system,  It 
Is  likely  that  at  least  some  of  the  submodels  will  be  susceptible  to 
analytical  methods.  It  Is  also  very  likely,  however,  that  the  analytical 
solution  of  some  of  the  submodels  will  remain  mathematically  Intractable  even 
with  simplifying  assumptions  and  constraints.  In  these  cases,  the  only 
alternative  evaluation  method  for  non-exlstlng  systems  Is  simulation. 

Simulation  is  a  numerical  technique  for  evaluating  queueing  network 
models  by  mimicking  the  dynamic  behavior  of  the  system  being  modeled.  The 
principle  advantage  of  this  technique  is  its  great  generality.  Most  of  the 
constraints  which  are  necessary  for  analytical  methods  have  little  consequence 
with  respect  to  simulation.  The  three  main  problems  with  simulation  are  the 
expense  involved  with  building  the  simulator,  the  expense  of  running  the 
simulation,  and  the  necessity  for  statistical  analysis  of  the  resulting  output 
data. 


The  problem  of  the  expense  of  running  simulations  derives  from  the  fact 
that  the  length  of  a  simulation  run  is  proportional  to  the  number  of  events 
which  must  be  simulated  rather  than  to  the  duration  of  simulated  time.  In  the 
example  of  Figure  6-1,  it  would  probably  be  desirable  to  run  a  simulation  of 
such  a  system  long  enough  to  see  perhaps  hundreds  of  events  at  the  macro-level 
in  order  to  ensure  that  the  simulation  reaches  a  steady  state.  Since  events 
at  this  level  occur  approximately  every  second,  we  will  wish  to  run  the 
simulation  for  something  on  the  order  of  say  1000  seconds.  But  if  the 
intermediate  and  micro  levels  are  also  entirely  included  in  the  model,  the 
total  number  of  micro  events  which  must  be  simulated  might  be  on  the  order  of 
several  billion.  Such  a  simulation  run  will  likely  be  very  expensive.  This 
problem  is  most  effectively  controlled  by  hierarchical  structuring.  This 
allows  low-level  models  to  be  evaluated  separately  and  the  results  summarized 
at  the  next  higher  level  in  the  form  of  a  scaling  factor  or  statistical 
distribution. 

6.2. 3-3  Hybrid  Methods 

A  combination  of  both  analytical  and  simulation  methods  may  be  used  in 
evaluating  a  model  of  a  large  system.  Again,  the  hierarchical  nature  of  the 
model  may  be  taken  advantage  of  to  allow  lower-level  sub-models  to  be 
evaluated  using  either  analysis  or  simulation  whichever  is  more  appropriate 
and  least  expensive.  The  results  thus  obtained  may  then  be  summarized  in 
modeling  the  higher  layers. 

6.2.4  Performance  Measures  for  Recovery  Mechanisms 

The  design  tradeoff  decisions  concerning  reliability  and  integrity 
mechanisms  and  performance  are  generally  more  complex  than  those  for 
conventional  systems  where  high  reliability  is  of  less  importance.  Such 
tradeoffs  are  conventionally  between  different  kinds  of  performance,  such  as 
resource  utilization  and  response  time.  The  only  other  analytical  property  of 
such  systems  is  their  correctness  -  the  degree  to  which  they  satisfy  their 
operational  specifications.  For  obvious  reasons,  the  correctness  of  a  program 
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is  rarely  purposely  comprofflised  In  favor  of  better  performance.  It  may, 
however,  be  perfectly  valid  to  design  an  object  so  that  it  Is  slightly  less 
reliable  but  responds  quicker  (or  vice  versa). 

Because  of  the  complex  tradeoff  decisions  which  are  likely  to  be  involved 
in  configuring  a  system  such  as  the  one  with  which  we  are  currently  concerned, 
it  will  be  necessary  for  the  designers  (and  possibly  also  the  system 

administrators)  to  have  at  their  disposal  reasonably  accurate  estimates  of  the 
costs  of  the  various  reliability  and  integrity  mechanisms  which  are  provided 
by  the  system.  In  order  to  provide  such  estimates,  the  analyst/designer  must 
examine  at  least  three  different  cases: 

o  The  performance  of  the  system  in  the  absence  of  the  relevant 

reliability/integrity  mechanisms. 

o  The  performance  of  the  system  with  the  relevant  mechanisms  in  place  but  in 
the  absence  of  the  failures  which  the  mechanisms  are  there  to  protect 

against.  This  class  of  performance  figures,  when  compared  to  those 
obtained  as  above,  will  provide  a  useful  estimate  of  the  best  case  cost  of 
providing  protection  from  faults. 

0  The  performance  of  the  system  with  integrity  mechanisms  in  place  and  when 

failures  of  the  defined  class  actually  occur.  Together  with  the  results 

obtained  in  the  first  two  cases  above,  these  figures  will  provide  an 
estimate  of  the  time  and  resource  requirements  of  the  recovery  mechanisms 
of  the  system. 

6.2.5  Example  Metrics  for  Some  Generic  Integrity  Mechanisms 

The  following  is  a  sampling  of  some  of  the  performance  measures  which  are 
likely  to  be  interesting  in  a  distributed,  object-oriented  reliable  system. 
Three  generic  classes  of  reliability  and  Integrity  mechanisms  are  used  to 
illustrate  the  issues  involved;  transactions,  concurrency  control,  and  object 
replication.  A  more  detailed  discussion  of  these  mechanisms  may  be  found  in 
Chapter  4. 

6.2.5. 1  Transaction  Mechanisms 

Of  course,  user-level  scenarios  will  probably  be  defined  as  or  in  terms 
of  atomic  transactions.  The  response  times  and  throughput  of  these  will  be  of 
primary  concern.  In  this  section,  however,  we  will  be  dealing  only  with  the 
low  level  performance  characteristics  of  the  mechanisms  used  to  attain 
atomicity  and  reliability  for  groups  of  associated  individual  type  manager 
operations.  The  following  is  a  list  of  some  of  these  low  level 
characteristics : 

0  Mean  Rollback  Time  -  The  mean  time  required  for  a  typie  manager  to  rollback 
an  object  to  a  previous  state  (the  state  of  the  object  at  the  time  of  the 
last  checkpoint). 

o  Mean  Size  of  '’Window  of  Vulnerability'*  -  The  mean  time  during  which  an 
object  is  vulnerable  to  a  failure  of  the  coordinator  of  the  transaction. 
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0  Mean  User  In-Doubt  Period  -  The  eean  tiee  from  when  a  user  decides  to 
coBualt  a  transaction  until  the  user  can  be  told  that  the  results  are 
comaltted. 

o  Mean  Coordinator  In-Doubt  Period  -  The  aean  tiae  (froa  when  a  coordinator 
issues  a  coaalt  aessage  until  it  receives  acknouledgaents  froa  object 
managers,  e.g.,  second  phase)  during  which  a  coordinator  aust  retain  the 
state  of  a  transaction. 

6. 2. 5. 2  Object  Replication 

There  are  both  costs  and  benefits  associated  with  aaintainlng  aultlple 
redundant  copies  of  some  objects.  The  costs  are  in  the  fora  of  the  additional 
storage  requlreaents  for  the  redundant  copies  and  the  tiae  and  resources 
required  to  ensure  that  the  multiple  copies  reaain  consistent.  The 
performance  benefit  steas  froa  the  fact  that,  in  soae  cases,  local  copies  of 
an  object  may  be  used  to  provide  read-only  access  thus  ellainating  the 
communication  costs  of  accessing  a  remote  copy  instead. 

0  Redundant  Storage  Overhead  -  The  additional  storage  and  other  resources 
required  to  maintain  all  but  one  of  the  identical  copies  of  an  object. 

0  Multiple  Update  Overnead  -  The  additional  tine  and  resources  required  to 
update  additional  copies  of  a  replicated  object. 

0  Read-Only  Access  Improvement  -  The  average  laproveaent  in  read-only  type 
operations  due  to  the  distribution  and  replication  of  an  object. 


6.2.6  Zeus  Performance  Modeling 

The  performance  evaluation  of  Zeus  is  carried  out  using  PANS  (Perforaance 
Analyst's  Workbench  System)  [IRA83],  a  general  purpose  slaulation  language  for 
the  performance  evaluation  of  system  models.  Our  choice  of  these  jsarticular 
tools  was  partly  due  to  in-house  familiarity  with  them  (PAWS  is  a  registered 
trademark  of  Information  Research  Associates),  and  partly  due  to  their 
suitability  for  the  tasks  of  representing  and  evaluating  perforaance  models. 

The  creation  of  a  performance  model  is  a  multi-step  process  involving 
first  a  determination  of  the  most  relevant  execution  pathways  in  the  system 
design  (i.e.,  there  are  many  possible  execution  pathways  in  any  systea,  and 
only  a  subset  of  those  is  used  with  great  frequency).  This  shifts  the  focus 
upon  those  modules  and  the  systea  activity  those  nodules  represent  that  is 
most  relevant  to  the  performance  of  the  system.  Once  the  perforaance 
determining  pathways  are  defined,  they  must  be  coupled  in  a  meaningful  fashion 
with  a  target  resource  (hardware)  configuration  and  a  specification  of  the 
resource  usages  along  the  performance  pathways. 

Execution  paths  are  translated  into  In'raation  Processing  Graphs  (IPGs), 
which  are  pictorial  constructs  for  modelin.  information  processing  systems. 
As  given  in  an  introduction  to  this  tool  CIRA83],  IPGs  are  a  useful  modeling 
methodology  for  several  reisons:  pictures  often  provide  the  best  method  for 
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describing  and  understanding  inforaatlon  flow;  it  is  easier  to  coaaunicate 
ideas  quickly  using  a  picture;  and  inforaatlon  processing  systeas  are  often 
designed  around  a  structure  of  inforaatlon  flow.  Froa  these  IPGs.  It  Is  a 
straightforward  translation  to  a  queueing  network  aodel. 

In  a  distributed  operating  systea,  inforaatlon  flows  through  resources  on 
hosts  and  between  hosts  in  the  network.  The  basic  graphical  coaponents  are 
nodes,  edges,  and  labels.  In  an  IPG.  each  node  represents  a  resource  (such  as 
CPU.  aeaory.  disk  units,  etc.)  while  edges  connect  nodes  and  represent  soae 
fora  of  inforaatlon  flow  froa  one  resource  to  another  along  an  edge.  Edges 
are  given  labels  denoting  the  fora  of  inforaatlon  flow.  The  IPGs  are  directly 
mappable  to  the  Perforaance  Analyst's  Workbench  Systea  (PAWS),  which  is  a 
simulation  language  that  is  used  to  evaluate  perforaance  aodels.  In  a  aodel. 
the  information  flows  wt.ich  are  of  interest  are  given  what  are  termed  category 
and  transaction  names  for  which  statistics  are  gathered  during  the  siaulation. 
Additionally,  for  each  resource  in  the  aodel  a  set  of  suaaary  statistics  is 
generated. 

6.2.7  Summary 

In  the  system  designers  guidebook  we  have  discussed  the  tools  and 
techniques  that  are  available  to  aid  in  the  perforaance  analysis  of 
distributed,  reliable  systems.  We  began  by  very  briefly  surveying  the  field 
of  performance  analysis  of  computer  systeas,  especially  emphasizing  the  issues 
that  were  relevant  to  performance  analysis  during  the  design  phases.  Hodeling 
and  aodel  evaluation  techiques  are  the  iaportant  topics  in  this  regard.  In 
addition  to  the  general  sketch  of  perforaance  aodellng,  we  also  give  a 
somewhat  more  detailed  account  of  several  of  the  representational  and 
simulation  tools.  The  specific  issues  Involved  with  aodellng  the  design  of  a 
particular  class  of  computer  systems  are  discussed  in  the  guidebook.  It 
should  be  apparent  froa  the  material  in  this  chapter  that  perforaance 
evaluation  during  the  early  stages  of  design  and  continuing  throughout  the 
lifetime  of  the  system  can  be  an  invaluable  strategy  for  producing  viable, 
efficient  software  products. 

6.3  Reliability  Analysis  Techniques 

Similar  to  performance  characteristics,  the  specification  and  evaluation 
of  reliability  characteristics  of  a  design  are  an  iaportant  and  Integral  part 
of  the  design  process  for  reliable  systems.  A  design  process  typically 
consists  of  several  phases  starting  with  the  requlreaents  specifications  up  to 
the  final  design  meeting  those  requirements.  These  phases  nay  involve  several 
iterations  of  designing  and  validation  until  the  design  meets  the  desired 
requlreaents.  For  reliable  systeas.  the  requirements  statements  aust  Include 
the  specifications  of  the  desired  reliability  characteristics  of  the  target 
systea.  Typically  a  design  process  consists  of  decomposing  the  design  into  a 
set  of  sub-problems.  In  such  cases  the  requirements  statements,  which  include 
the  reliability  specifications,  are  appropriately  extended  and  augmented  for 
each  of  the  sub-systems.  The  validation  task  consists  of  verifying  that  the 
target  systea  constructed  from  those  sub-systems,  with  the  given  reliability 
characteristics,  has  the  desired  reliability. 
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The  traditional  approach  to  specifying  reliability  characteristics  is  to 
use  certain  nuaerlcal  measures  such  as  availability,  mission  time,  and 
mean-tlme-to-fallure  (HTTP).  Chapter  2  described  some  discrete  measures  for 
reliability  specifications.  These  measures  imply  that  a  system  has  certain 
failure  characteristics  under  a  given  set  of  system  faults.  The  measures 
capture  the  level  of  consistency  maintained  by  the  system  under  this  set  of 
faults. 

There  are  essentially  two  approaches  to  reliability  analysis  of  system 
designs  —  simulation  and  analysis.  One  approach  is  to  simulate  the  system 
design  along  with  its  failure  environment  and  the  recovery  mechanisms.  This 
approach  is  inherently  expensive  because  it  requires  building  simulation 
models  specific  to  the  design  to  be  analyzed.  This  approach  provides 
relatively  more  accurate  results  as  compared  to  the  second  approach  because  it 
captures  the  structure  and  functioning  of  the  system  to  a  greater  detail.  The 
second  approach  is  based  on  combinatorial  analysis  of  the  system  based  on  the 
reliability  characteristics  of  its  components  and  their  interconnections.  The 
reliability  characteristics  of  the  components  are  specified  in  terms  of 
availability  and  MTTF.  This  approach  is,  in  general,  faster  and  less 
expensive. 

The  combinatorial  analysis  methods  provides  quick  first-order  evaluations 
of  the  system  reliability  characteristics  given  the  system  configuration  and 
the  reliability  characteristics  of  its  components.  These  methods  can  be  used 
to  construct  a  general  purpose  evaluation  tool.  One  such  tool  called  NetRAT 
(Network  Reliability  Analysis  Tool)  is  described  in  this  chapter.  The 
evaluations  using  this  method  are  somewhat  less  accurate  as  compared  to  using 
simulation  models  because  they  do  not  capture  some  of  the  dynamic  operating 
conditions  such  as  execution  delays,  system  load,  and  resource  contention. 

6.3.1  Specifications  of  Reliability  Measures 

Traditionally  the  reliability  characteristics  of  a  system  are  expressed 
in  terms  of  certain  probabilistic  measures  such  as  the  availability, 
reliability,  mean-tlme-to-fallure  (MTTF),  and  raean-time-to-repair  (MTTR)  for 
repairable  systems,  mission  time,  etc. 

For  a  large  system,  such  as  a  distributed  command  and  control  system, 
rather  than  specifying  the  availability  and  MTTF  of  the  entire  system  one 
would  be  more  realistic  in  Individually  specifying  the  reliability 
characteristics  of  its  services  and  virtualized  resources  as  seen  by  the 
system  users.  The  approach  that  we  follow  consists  of  specifying  the 
reliability  characteristics  of  the  functions  executed  on  the  system  objects. 
These  characteristics  for  a  function  will  be  different  for  its  invocations 
from  different  nodes.  For  example  a  system  service  might  be  available  of 
the  time  when  accessed  from  one  node,  whereas  the  same  service  might  be 
available  for  only  90%  of  the  time  when  accessed  from  some  other  node. 

The  availability  A(t)  of  a  system  is  a  function  of  time  indicating  the 
probability  that  the  system  is  functioning  correctly  at  any  given  time  t.  In 
distributed  systems,  we  are  interested  in  computing  the  availability  of  the 
functions  (services),  which  expresses  the  probability  of  that  function 


63 


(service)  being  available  at  any  randoa  instant  of  tine.  A  function  execution 
at  a  node  requires  access  to  some  resources  vihich  are  distributed  in  the 
network.  A  successful  execution  of  the  function  requires  that  the  resources 
be  accessible  from  the  node  where  the  function  is  being  executed.  Therefore, 
the  availability  of  a  function  is  dependent  on  the  availability  of  {1}  the 
communication  paths  to  the  required  resources,  (2)  the  nodes  holding  the 
resources,  and  (3)  the  nodes  executing  the  function. 


The  mean- time- to- failure  (MTTF)  for  a  service  in  a  distributed  system  is 
the  expected  time  interval  during  which  that  service  remains  available  before 
a  failure  occurs.  A  service  falls  if  it  is  unable  to  access  any  of  the 
required  resources  or  if  the  node  executing  the  service  falls.  MTTF  is  an 
important  measure  of  reliability  in  distributed  systems  because  of  the 
possibility  of  large  delays  encountered  in  communication. 

Using  the  numerical  reliability  measures  for  requirements  specifications 
raises  certain  problems.  One  of  the  problems  is  dealing  with  small  numbers  in 
specifying  these  measures.  Another  problem  is  related  to  the  fact  that  the 
reliability  measures  of  the  system  components  are  significantly  altered  in  the 
combat  conditions.  Under  such  circumstances  the  reliability  analysis 
techniques  should  focus  on  determining  whether  the  system  performs  correctly 
and  in  a  timely  fashion  if  a  certain  set  of  resources  are  unavailable.  This 
leads  to  specifying  a  discrete  set  of  reliability  levels  corresponding  to  the 
consistency  levels  maintained  by  the  system  under  various  fault  conditions 
within  the  system.  The  system  designers  guidebook  presents  four  discrete 
reliability  classes  for  objects. 

6.3.2  Network-Based  Reliability  Model 

This  section  describes  a  network-based  approach  for  representing  a  system 
to  evaluate  its  reliability.  This  model  ideally  suits  for  representing 
distributed  system  architectures.  In  the  past  a  considerable  amount  of  work 
has  been  done  in  the  evaluation  of  reliability  and  availability  of  paths  in 
network-based  systems,  particularly  in  the  area  of  communication  networks. 
Most  of  this  work  addresses  the  problem  of  pair-wise  terminal  reliability  in 
communication  networks,  i.e.,  given  a  pair  of  nodes  in  the  system,  determine 
the  availability  of  the  communication  path  between  these  two  nodes. 

In  distributed  systems,  an  important  generalization  of  the  pair-wise 
terminal  reliability  problem  considers  the  availability  of  paths  from  a  set  of 
nodes  in  the  network  to  a  different  set  of  nodes.  For  example,  a  service 
execution  in  a  network  might  require  access  to  several  resources  that  are 
located  at  different  nodes.  It  is  also  possible  for  a  service  to  require 
access  to  any  one  of  the  several  resources  distributed  in  the  network.  For 
example,  a  read  operation  on  a  replicated  file  can  be  successfully  performed 
if  the  node  executing  this  operation  can  reach  any  one  copy  of  the  file.  This 
is  referred  to  as  the  multi-terminal  reliability  problem  and  has  been 
addressed  in  a  recent  work  [GRNA8I].  In  [GRNA81}  an  algorithm  is  presented 
which  computes  the  multi-terminal  availability  from  the  availability  of  the 
network  components. 
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NetRAT  Is  a  reliability  analysis  tool  for  nettrork-based  systess,  which 
facilitates  the  evaluation  of  multi- teralnal  reliability  characteristics.  The 
NetRAT  systea  is  essentially  based  on  the  algorithms  described  in  [GRNASl] 
[GRNA80].  However,  the  algorithm  presented  in  [GRNA80]  is  incorrect.  We  have 
corrected  this  algorithm  in  [WANG83I  and  incorporated  it  into  NetRAT.  In 
addition  to  the  availability  calculation,  NetRAT  also  permits  the  evaluation 
of  other  reliability  measures,  such  as  the  reliability  function, 
mean-tlme-to-failure  (HTTP),  and  mission  time.  These  extensions  are  described 
in  the  next  section. 

The  reliability  analysis  model  underlying  the  NetRAT  system  is 
network-based,  and  the  evaluation  procedures  are  combinatorial.  In  the 
network-based  model,  a  system  is  represented  as  an  interconnection  of  nodes . 
The  nodes  represent  the  functional  units;  and  the  links,  which  can  be  either 
directional  or  bidirectional,  represent  the  communication  paths.  Reliability 
measures  such  as  availability,  reliability,  HTTP,  etc.,  are  sissoclated  with 
these  components.  In  the  NetRAT  model,  a  set  of  functions  and  resources  are 
assigned  to  these  nodes.  Each  function  requires  access  to  some  resources, 
which  can  be  physical  resources  or  other  functions.  Functions  in  the  NetRAT 
model  correspond  to  activities  which  provide  services  in  real  systems;  and 
physical  resources  in  the  NetRAT  model  correspond  to  data  and  hardware 
resources  in  real  systems,  such  as  processors,  memory,  disks,  I/O  devices, 
files,  etc. 

A  node  may  contain  more  than  one  resource  or  service,  and  multiple  copies 
of  a  resource  may  exist  at  several  different  nodes.  In  case  of  multiple 
copies  of  a  resource,  any  one  of  these  copies  can  be  used  to  meet  the  resource 
requirements  of  a  function.  A  function  may  be  available  at  several  different 
nodes.  The  resource  requirements  of  a  function  can  be  combinatorial;  for 
example,  a  function  may  require  resources  (A  and  B  and  C)  or  (B  and  D). 

We  illustrate  this  network-based  model  using  a  set  of  examples.  Consider 
the  network  shown  in  Figure  6-3.  This  network  model  consists  of  four  nodes 
1,2,3»  and  4.  The  availability  data  of  these  nodes  and  the  interconnecting 
links  are  shown  in  the  figure.  A  function  (program)  called  FUN  executes  at 
node  1.  This  function  requires  access  to  resource  R1  and  R3.  Resource  R1  is 
located  at  two  nodes,  2  and  4,  and  resource  R3  is  located  at  nodes  3  and  4. 
In  this  example,  we  are  interested  in  computing  the  availability  of  function 
FUN.  In  the  model  shown  in  Figure  6-3,  if  a  node  is  available  (functioning 
correctly),  then  all  the  resources  located  at  that  node  are  available. 
Consider  another  scenario  in  which  a  node  may  be  available,  but  the  resources 
located  at  it  may  not  all  be  available.  At  a  given  node,  the  availability  of 
a  local  resource  could  be  less  than  1.0.  For  example,  in  the  system  of  Figure 
6-3,  resource  R2  at  node  3  is  available  with  probability  0.7  and  R3  with 
probability  0.8.  In  order  to  represent  this  system  in  the  NetRAT  model,  the 
network  model  in  Figure  6-3  is  changed  to  that  in  Figure  6-4.  Here  resources 
R2  and  R3  are  represented  as  separate  nodes  (shown  as  nodes  5  and  6)  connected 
to  node  3. 
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As  aentloned  earlier,  It  Is  possible  to  Include  In  the  resource 
requirements  of  the  function  FUN  some  other  function  names.  The  actual 
physical  resources  required  for  FUN  include  the  union  of  the  resources 
required  by  each  of  the  fuctlons  whose  names  appear  in  the  resource 
requirement  of  FUN.  For  example,  in  a  modified  scenario  the  function  FUN 
requires  resources  R1,  R2  and  FIND,  where  FIND  is  a  function  that  requires  the 
resource  R3>  Hence  the  total  resource  requirement  for  FUN  consists  of  R1,  R2 
and  R3*  Recursive  references  to  function  names  in  the  resource  requirements 
are  permitted,  as  long  as  they  can  be  resolved  in  terns  of  physical  resource 
requirements. 

6.3.3  CONCLUSIONS 

In  the  system  designers  guidebook,  we  have  presented  a  modeling  and 
analysis  method  to  evaluate  the  reliability  characteristics  of  systems.  The 
analysis  is  combinatorial  and  it  is  not  easy  to  use  manually.  It  is  advised 
that  such  procedures  be  automated  as  a  general  purpose  tool.  A  system  called 
NetRAT,  which  is  based  on  such  a  procedure,  has  been  described  in  the 
guidebook.  The  modeling  approach  used  in  NetRAT  is  network  based,  i.e.,  the 
system  is  viewed  as  a  collection  of  nodes  connected  by  either  bidirectional  or 
unidirectional  links.  Reliability  characteristics  of  the  individual  links  and 
nodes  are  used  to  determine  the  reliability  characteristics  of  the  composite 
system  and  is  particularly  attractive  from  the  viewpoint  of  hierarchical 
analysis  of  systems. 


6.^  Validation  And  Verification  Techniques 

In  this  section  we  describe  three  methods  for  proving  or  analyzing  the 
fault-tolerant  properties  of  distributed  system  designs.  The  detailed 
descriptions  of  these  methods  can  be  found  in  the  first  volume  of  the  system 
designers  guidebook.  The  first  method  is  based  on  applying  program 
verification  techniques  to  the  design  expressed  in  a  suitable  programming 
language.  This  method  is,  in  general,  expensive  in  terms  of  time  and  it 
necessarily  requires  support  of  automated  tools  during  the  verification 
process.  This  involves  proving  certain  formally  stated  properties  of  the 
software  system.  During  the  last  decade  a  considerable  amount  of  work  has 
been  done  in  the  area  of  developing  languages  and  their  support  tools  that 
facilitate  formal  verification  of  the  software.  The  most  notable  of  these 
systems  are  Affirm  [GERH8O],  HDM  CROBI753,  FDM  [KEMM80],  and  Gypsy  [GOOD78]. 
The  Gypsy  language  and  its  verification  system  have  been  designed  to 
facilitate  verification  of  communicating  processes.  This  makes  Gypsy  an 
attractive  candidate  in  this  category  of  methods.  In  this  section  we  present 
a  brief  description  of  the  application  of  the  Gypsy  methodology  to  proving  the 
fault-tolerant  characteristics  of  a  system  that  is  structured  according  to  the 
design  model  presented  in  Chapter  4.  Application  of  this  technique  is 
suitable  when  the  design  has  been  refined  and  specified  to  a  detailed  level. 
The  examples  presented  in  Chapter  10  of  the  guidebook  deal  with  the  recovery 
mechanisms  at  a  single  site. 

The  second  method  for  proving  fault- tolerance  of  distributed  system 
designs  is  relatively  less  rigorous  and  is  amenable  to  manual  proofs  for  small 
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systems.  Nevertheless,  this  method  can  be  developed  into  a  computer-assisted 
system.  In  this  approach  we  focus  on  proving  properties  of  a  set  of 
communicating  processes.  The  system  is  abstracted  as  a  collection  of  finite 
state  machines  which  Interact  by  exchanging  messages.  The  proofs  are  based  on 
the  properties  of  the  state  sequences  of  these  machines  and  the  relationship 
among  the  state  sequences  based  on  the  communication  events.  Each  finite 
state  machine  Is  specified  in  terns  of  events  and  state  transitions.  Detailed 
descriptions  of  the  system  in  a  programming  language  are  not  required  at  this 
level;  therefore,  this  method  looks  attractive  at  the  higher-level  design 
phases  such  as  the  conceptual  design  or  the  functional  architecture  design. 
An  example  dealing  with  the  proof  of  a  two-ph2ise  commit  protocol  is  presented 
in  the  guidebook. 

The  last  method  of  design  validation  for  fault- tolerance  is  based  on 
functional  simulation  of  the  design.  In  this  method,  to  validate  certain 
recovery  characteristics  of  a  system,  simulation  models  of  the  appropriate 
parts  of  the  design  are  constructed  in  a  suitable  simulation  language  .*"jch  as 
Path  Pascal,  Simula  or  PAWS.  The  simulation  models  mimic  the  functional 
behavior  of  the  actual  system  as  intended  by  the  design.  Some  of  the  basic 
issues  involved  in  simulating  the  fault- tolerance  characteristics  of  a  design, 
the  requirements  on  the  simulation  language  for  this  purpose,  and  the  salient 
features  of  a  Path  Pascal  simulator  for  the  Process /Transact ion  Manager  in  the 
Zeus  system.  This  approach  is  expensive  in  terms  of  time  and  effort  because 
it  requires  building  exact  simulation  models  of  the  system  components. 

6.4.1  Proofs  of  Recovery  Mechanisms  using  Gypsy 

6.4. 1.1  Introduction 

Gypsy  is  a  mature  methodology  for  constructing  formal  proofs  that  a 
software  system  satisfies  formal  specifications  [GOOD78,  G00D82a,  GOOD82bl. 
Gypsy  has  been  applied  very  successfully  in  several  security  applications,  but 
no  attempt  has  been  made  to  apply  Gypsy  to  recovery  problems.  The  focus  in 
this  effort  has  been  to  answer  the  question,  "What  can  be  specified  and  proved 
about  recovery  mechanisms  with  the  existing  Gypsy  methodology?" 

The  designers  guidebook  describes  two  examples  based  upon  work  described  in 
Chapter  4.  The  first  set  of  three  examples  illustrate  different  Gypsy 
implementations  of  recoverable  objects.  A  generic  shell  formally  specifying 
the  behavior  of  recoverable  objects  is  given,  and  then  three  different 
implementations  are  shown  to  satisfy  the  formal  specifications  of  the  shell. 
The  second  example  is  a  Gypsy  model  of  a  transaction  recovery  scenario  given 
in  Chapter  4.  The  recovery  scenario  is  modeled  in  Gypsy,  and  formally 
specified.  This  example  supports  the  position  that  the  precision  required  to 
write  formal  specifications  has  the  potential  to  contribute  significantly  to 
the  quality  of  the  resulting  design. 

The  intent  of  the  recovery  mechanisms  considered  in  the  effort  related  to 
formal  verification  of  recovery  mechanisms  is  to  provide  necessary  support  the 
implementation  of  atonic  transactions.  An  atonic  transaction  is  an  operation 
with  the  property  that  if  it  fails,  the  data  objects  that  it  was  altering  are 
restored  to  the  values  that  they  had  when  the  transaction  first  accessed  them. 
In  a  distributed  system  there  are  the  added  complications  of  multiple  copies 
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of  data  objects  which  must  be  maintained  in  synch,  transactions  which  nay  be 
spread  over  multiple  machines,  and  host  crashes  and  message  transmission 
errors. 

The  process  of  developing  formally  specified  code  is  not  radically 
different  from  the  standard  cycle  of  software  development.  The  critical 
difference  lies  in  the  use  of  formal  specifications.  Because  they  are  so 
precise,  formal  specifications  are  often  more  difficult  to  write  than  an 
English  statement  of  a  functional  specification.  However  due  to  this  enforced 
precision  the  resulting  specifications  are  considerably  more  useful. 
Additionally,  the  requirement  that  the  code  be  proven  to  coincide  with  the 
specifications  provides  a  tremendous  increase  in  confidence  in  the  resulting 
software. 

One  point  must  be  emphasized.  Formal  specification  and  proof  does  not 
guarantee  that  there  will  be  no  errors  in  the  code.  It  is  possible  that  the 
specifications  do  not  capture  the  designer's  intentions.  It  is  also  possible 
to  specify  only  part  of  the  functionality  of  a  program,  in  which  case  the 
unspecified  portions  of  the  program  may  go  wrong.  What  the  verification 
process  does  assure  us  is  that  the  specification  and  the  code  are  consistent 
with  each  other. 


6.4. 1.2  Gypsy  Support  for  the  Specification  of  Recovery  Mechansims 

Gypsy  provides  a  number  of  mechanisms  that  support  the  verification  of 
recovery  mechanisms.  There  are  two  basic  sorts  of  approaches  that  can  be 
taken,  which  are  reflected  in  the  two  examples  in  this  section.  One  is  an 
object  oriented  approach,  which  makes  use  of  Gyspy's  standard  specification 
methods,  in  this  case  lemmas  to  algebraically  specify  object  properties  and 
routine  specifications  to  specify  the  effects  of  operations  on  objects. 
Gypsy's  abstract  data  type  facility  could  also  be  used  effectively  for  these 
examples  .  this  approach  one  can  describe  the  required  properties  of  the 
selected  objects  and  then  demonstrate  that  the  proper  selection  of  procedures 
to  manipulate  these  objects  maintains  this  set  of  specified  properties. 

The  other  approach  is  to  develop  a  procedural  model  that  takes  advantage  of 
Gypsy's  concurrency  mechanisms  to  simulate  the  distributed  world,  with  buffer 
operations  to  carry  message  traffic.  Buffer  histories  are  used  to  specify 
such  systems. 

These  two  methods  are  complementary.  On  the  one  hand  the  object  oriented 
specifications  provide  a  mechanism  to  specify  the  properties  that  recoverable 
objects  must  have.  A  procedural  model  then  permits  the  verification  that  the 
procedures  designed  to  maintain  these  objects  in  a  proper  state  function  as 
intended. 

6.4. 1.3  Specifications  and  Proofs  of  Recoverable  Objects 

The  example  presented  in  the  guidebook  is  chosen  from  the  design  model 
for  reliable  distributed  systems  presented  in  Chapter  4.  We  have  chosen  the 
recoverable  object  level  of  the  model,  which  is  built  upon  stable  objects,  and 


supports  atomic  transactions.  In  this  example  the  stable  objects  are  of 
arbitrary  type  (left  "pending**  In  the  Gypsy  notation). 

We  do  not  concern  ourselves  with  the  Issues  of  security  and  access  control 
identified  in  the  model.  The  problems  of  transaction  and  process  management 
would  be  dealt  with  outside  of  the  portions  of  the  system  modelled  here.  We 
also  do  not  consider  the  problems  of  Implementing  stable  objects,  but 
construct  recoverable  objects  out  of  a  data  type,  which  Is  an  abstraction  that 
is  left  pending. 

First,  formal  specifications  of  recoverable  objects  are  given.  Then  three 
different  Implementations  of  recoverable  objects  are  presented,  and  proven  to 
meet  the  required  specifications.  Finally,  we  give  two  examples  of  how  this 
model  might  be  extended  to  cover  two  more  abstraction  layers  of  Figure  4-2. 
The  effects  of  incorporating  the  stable  object  layer  beneath  the  recoverable 
object  layer  on  the  proofs  are  discussed  In  detail  In  the  guidebook.  The 
description  of  recoverable  object  can  be  used  as  the  beisis  for  the  next  higher 
level  handling  of  atomic  transactions.  The  example  builds  a  small  type 
manager  that  employs  a  simple  locking  protocol  based  on  the  recoverable  object 
specification. 

6.4. 1.4  Recovery  Scenario  for  Atomic  Actions 

The  study  of  applying  Gypsy  methodology  to  proving  atomicity  of 
transactions  in  the  example  system  demonstrates  the  utility  of  modeling  system 
designs  in  Gypsy.  Even  in  the  absence  of  proof,  the  need  to  write 
specifications  precise  enough  to  support  critical  Inspection  forces  a  detailed 
examination  of  assumptions.  While  this  Is  a  subjective  process,  as  opposed  to 
the  objective  nature  of  the  proof  process.  It  Increases  the  likelihood  that 
the  coverage  of  the  specifications  is  sufficient  to  describe  the  behavior  of 
the  system  under  all  cases  Included  in  the  top  level  specification.  In  other 
words,  the  specifications  on  the  various  components  of  the  system  are  likely 
to  support  our  expectations  (as  embodied  in  a  top  level  specification)  about 
the  system  as  a  whole. 

6.4. 1.5  Summary 

Based  on  this  experience,  we  offer  the  following  observations. 

1.  Various  aspects  of  distributed  command  and  control  systems  can  reasonably 
be  described  (formally  specified  and  Implemented)  In  Gypsy,  and  the 
implementation  verified  against  the  formal  specification.  The  Gypsy 
implementations  may  well  serve  only  as  models  for  actual  Implementations 
in  other  languages,  but  such  efforts  should  significantly  Increase 
confidence  in  the  correctness  of  the  resulting  code. 

2.  Some  elements  of  these  systems  do  not  map  directly  into  Gypsy.  For 
example,  the  notion  of  spawning  a  process  is  not  supported  by  the  Gypsy 
model  of  process  Invocation.  Thus,  some  pieces  of  the  system  design  can 
only  be  modelled  In  Gypsy  In  a  fashion  quite  different  from  the  Intended 
implementation.  We  believe,  based  on  our  own  experience,  that  composing 
these  models,  and  formally  specifying  and  verifying  them  can  have 
considerable  benefit  in  enhancing  the  designer's  understanding  of  the 
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systei,  and  the  level  of  precision  in  the  systea  description.  Even  in  the 
absence  of  formal  proof,  the  additional  precision  supplied  by  formal 
specification  can  be  of  utility. 

6.4.2  Recovery  Mechanism  Proofs  using  Interval  logic 

Verification  techniques  based  on  analysis  of  input  and  output  message 
streams  [MISRA81]  and  message  buffers  [GOOD79]  suffice  for  establishing 
"black-box”  stimulus-response  behavior  of  a  process  or  network  of 
communicating  processes.  However,  an  Important  class  of  properties  — 
relationships  among  system  state  variables  —  cannot  be  as  easily  expressed 
and  verified  with  these  methods.  We  would  like  to  be  able  to  combine 
assertions  over  the  states  of  several  processes,  so-called  "local"  assertions. 
Into  a  system-wide  "global"  assertion  stating  a  relationship  among  the 
variables  of  the  several  processes.  The  difficulty  is  that  such  an  assertion 
is  Intended  to  hold  at  some  particular  "time",  1.  e.,  point  in  the  history  of 
the  system,  and  this  requires  a  rough  synchronization  or  "linlng-up"  In  time 
of  the  various  processes.  If  the  assertions  are  construed  as  holding  at  some 
instant  of  time  then  it  is  required  that  the  processes  be  precisely 
synchronized  so  that  at  that  instant  all  the  related  variables  are  stable  and 
in  the  desired  relationship.  This  precise  synchronization  is  difficult  to 
verify  and  can  be  expensive  for  the  system  to  arrange. 

The  primary  contribution  of  this  work  is  the  development  of  a  framework 
that  facilitates  construction  of  global  assertions  from  local  assertions.  The 
following  section  presents  in  an  informal  fashion  the  approach  for 
constructing  global  assertions  about  the  communicating  processes  in  a 
distributed  systems  from  the  local  assertions  of  individual  processes.  This 
approach  examines  behavior  of  such  processes  over  certain  intervals, 
establishes  relationship  among  the  intervals,  and  then  derives  global 
assertions  using  these  relationships.  Because  the  process  behavior  is 
described  over  intervals,  we  find  use  of  temporal  logic  notation  [MOSZ83» 
HALP83]  convenient  in  such  proofs. 

6.4.2. 1  Proofs  of  Global  Assertions 

The  approach  presented  here  is  intuitively  quite  simple.  In  this  method 
each  communicating  process  is  viewed  as  a  finite  state  machine.  The  state 
transitions  in  such  a  finite  state  machine  occur  either  due  to  some  internal 
or  external  events.  The  external  events  correspond  to  the  arrival  of  a 
message  from  some  other  process.  A  process  in  a  given  state  maintains  certain 
assertions  over  its  variables.  We  refer  to  such  assertions  as  the  local 
assertions.  The  occurrence  of  some  event  may  cause  a  process  to  enter  a  new 
state;  during  this  state  transition,  the  process  may  execute  certain  actions 
which  lead  to  new  events  in  the  system.  Some  of  these  actions  -  those  which 
send  messages  to  other  processes  -  may  cause  occurrence  of  events  in  some 
remote  processes. 

A  sequence  of  state  transitions  in  a  process  can  be  represented  as  a 
sequence  of  states.  Such  a  state  sequence  also  represents  the  behavior  of  the 
process  over  some  interval  in  that  process's  life-time.  If  a  local  assertion 
holds  true  in  each  state  of  a  state  sequence,  then  we  say  that  that  assertion 


71 


holds  over  the  state  sequence.  This  leads  to  another  way  of  characterizing 
intervals  in  a  process's  life-time  as  the  state  sequences  over  which  certain 
assertions  are  maintained.  Therefore  in  the  rest  of  this  report  we  use  the 
term  interval  to  characterize  those  state  sequences  which  maintain  some 
assertion.  In  case  of  finite  state  machines,  state  sequences  of  Interest  can 
be  identified  by  constructing  the  regular  expressions  for  the  machine.  The 
Interested  readers  should  look  into  some  text-books  on  automata  theory  for 
this  purpose.  These  regular  expressions  also  define  the  reachability  sets  for 
the  states. 

An  important  step  in  deriving  global  assertions  on  the  basis  of  the  local 
assertions  of  individual  processes  over  their  state  sequences  is  to  establish 
relationships  among  those  state  sequences  (or,  intervals).  These 
relationships  define  if  an  interval  precedes  or  is  contained  in  some  other 
Interval.  Such  relationships  are  established  on  the  basis  of  the 
communication  events  among  the  processes.  Using  a  partial  ordering  model  for 
events  in  distributed  system,  such  as  the  one  presented  in  [GRE178],  one  can 
establish  precedence  and  containment  relationship  among  the  local  intervals 
(state  sequences)  of  the  various  processes  in  a  distributed  system.  An 
interval  II  is  contained  in  another  interval  12  if,  and  only  if,  the  first 
event  of  II  precedes  the  first  event  of  12,  and  the  last  event  of  12  precedes 
the  last  event  of  II. 

A  global  assertion  in  a  distributed  system  relates  local  variables  of 
several  processes  over  seme  intervals  in  the  life-time  of  that  system.  The 
important  step  is  the  conjunction  of  several  local  assertions  over  the  same 
Interval.  Suppose  that  an  assertion  p  holds  true  for  the  local  variables  of 
some  process  P  during  some  local  Interval  II,  and  an  assertion  q  holds  true 
for  the  local  variables  of  some  process  Q  during  some  local  Interval  12.  Now 
suppose  that  12  is  contained  in  11.  This  means  that  during  the  interval  12 
(which  can  be  viwed  as  a  global  interval  for  P)  the  assertion  p  is  true  for 
process  P.  Therefore,  the  assertion  (p  and  q)  is  true  for  the  set  of 
processes  P  and  Q  during  the  interval  12.  The  validity  of  this  statement  is 
quite  obvious. 

The  method  for  proving  global  assertions  in  this  approach  is 
schematically  shown  in  Figure  6-1.  The  partial  order  relation  between  events 
defines  intervals  and  the  containment  relationship  between  intervals.  A  local 
assertion  for  a  process  during  an  interval  is  derived  from  the  set  of 
reachable  states  during  this  interval.  The  reachability  set  during  an 
interval  is  computed  from  the  initial  state  during  the  interval  and  the  state 
transition  specifications  along  with  the  events  that  can  possibly  occur  during 
this  interval. 
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Partial  order  relation 


reachability  of  states 


A  Schematic  Representation  of  the  Proof  Method  Using  Intervals 

Figure  6-5 


The  first  step  in  the  method  is  to  specify  each  process  as  a  finite  state 
machine.  This  requires  definition  of  the  states  and  the  state  transitions 
under  various  events.  The  set  of  events  also  Include  some  communication 
events,  such  as  sending  or  receiving  a  message.  For  each  process,  based  on 
its  finite  state  machine  description,  the  regular  expressions  are  constructed. 
There  regular  expressions  are  used  for  reachability  analysis  during  the 
proofs.  The  next  step  involves  identification  of  the  intervals  of  interest 
for  which  certain  properties  are  to  be  proved;  this  requires  a  clear 
understanding  of  the  problem.  These  intervals  are  in  general  subsequences  of 
the  regular  expressions  for  the  finite  state  machine.  Relating  the  intervals 
of  different  processes  for  global  reaisoning  is  done  on  basis  of  the 
communication  events. 


6.4.3  Functional  Simulation  of  Fault  Tolerance 

Functional  simulation  is  an  approach  for  validating  that  a  model  of  a 
software  system  exhibits  a  desired  property.  In  this  section  we  discuss  the 
use  of  functional  simulation  techniques  for  validating  that  a  software  system 
is  fault  tolerant.  The  discussion  is  based  on  our  experience  building  a  Path 
Pascal  model  that  simulates  a  subset  of  the  Zeus  process/transaction  manager 
and  a  subset  of  a  generic  object  manager.  The  fault  tolerance  property  that 
is  validated  is  that  transactions  do  provide  an  ''all  or  nothing"  effect  even 
if  site  crashes  occur  and  messages  are  lost  or  duplicated. 


Functional  simulation  is  one  of  the  few  techniques  that  permits  the  early 
examination  of  the  behavior  of  a  program.  The  costs  associated  with  this 
technique  are  directly  related  to  where  in  the  lifecycle  the  activity  is 
performed.  The  closer'  to  implementation,  the  more  detailed  a  model  will  be, 
more  costly  to  develop,  and  difficult  to  validate  and  analyze,  but  the  greater 
the  potential  insights  and  benefits.  Functional  simulation  is  a  form  of 
testing  and  does  have  the  same  disadvantages  of  testing.  It  can  show  the 
presence  of  anomalous  behavior  but  cannot  prove  the  absence  of  anomalous 
behavior.  A  survey  of  different  approaches  to  software  verification  and 
validation  and  their  strengths  and  weaknesses  is  given  in  [ADRI82]. 

Functional  simulation  uses  an  executable  model  to  represent  the  behavior 
of  an  object  for  the  purpose  of  analyzing  whether  the  object  correctly 
exhibits  a  desired  property.  Validation  consists  of  observating  the  behavior 
that  a  model  of  an  object  exhibits  when  executed  on  models  of  a  computational 
environment  and  external  environment  and  analyzing  that  behavior  with  respect 
to  the  desired  behavior  of  the  software  system.  For  example,  if  the  property 
is  security,  the  object  modeled  may  be  an  operating  system  kernel.  The 
validation  may  consist  of  observing  tdiich  requests  are  granted  and  denied 
access;  this  can  then  be  compared  with  what  a  model  of  the  security  property 
defines  as  correct  behavior. 

6.4.3. 1  Issues  in  Simulating  Fault  Tolerance 

Simulating  fault  tolerance  and  distributed  systems  raises  a  number  of 
issues  which  impose  requirements  on  the  simulation  system  selected.  This 
section  discusses  some  of  the  technical  difficulties  that  we  have  encountered 
and  their  implication  for  different  simulation  systems.  A  discussion  of 
solutions  to  the  problems  presented  in  this  section  and  of  the  actual  model 
are  given  in  the  system  designers  guidebook. 

The  technical  difficulties  that  arise  are  directly  related  to  the 
property  that  is  to  be  validated  and  the  technique  that  is  to  be  used  for 
validation.  Clearly  validating  the  security  of  a  design  will  involve 
different  modeling  issues  and  validation  techniques.  The  following  discussion 
is  restricted  to  fault  tolerance  and  specifically  to  validating  the  atomicity 
of  transactions,  although  some  of  the  discussion  is  relevant  to  the  modeling 
and  validation  of  properties  in  general. 

The  key  events  to  be  modeled  are  failures,  so  it  makes  sense  to  examine 
what  their  Impact  is  on  a  model.  There  are  a  number  of  failures  that  are  of 
interest.  Among  the  kinds  of  failures  are  site  crashes,  memory  failures  (both 
primary  and  secondary},  link  failures,  and  lost  and  duplicate  messages.  The 
requirements  that  failures  impose  on  a  model  may  be  analyzed  by  examining 
their  effect  on  a  computation. 

A  site  crash  results  in  all  active  computations  halting  at  the  same  time 
in  an  unknown  state.  A  model  must  be  able  to  represent  multiple  concurrent 
activities  and  control  their  progress.  The  multiple  concurrent  activities 
model  a  system  executing  a  certain  number  of  user  processes  (e.g.,  equal  to 
the  desired  multiprogramming  level)  and  system  processes.  Controlling  the 
computation's  progress  includes  stopping  a  process  when  an  event  occurs  and 
continuing  the  process  when  some  other  event  occurs.  There  are  two  types  of 


74 


ANALYSIS  AND  VALIDATION  TECHNIQUES 


events  of  Interest  —  resource  coordination  and  processor  failures.  Resource 
coordination  aay  occur  if  multiple  processes  are  sharing  a  resource  (such  as  a 
processor  or  a  file)  or  are  cooperating  to  complete  a  computation  (such  as  a 
buffer  for  a  producer  and  consumer).  Event  coordination  may  be  achieved  by  a 
synchronization  object  and  mechanisms  (e.g.,  a  semaphore  and  the  ability  to 
allocate  it  and  block  processes).  Processor  failures  require  the  ability  to 
stop  all  processes  simultaneously  and  to  cause  them  to  make  a  transition  to  a 
well  defined  next  state. 

It  is  not  acceptable  to  put  the  burden  of  event  management  in  the 
application  and  system  processes.  A  process  should  not  check  the  state  of  a 
resource  each  time  that  it  accesses  it,  and  it  should  not  check  to  see  if  the 
processor  is  active  before  it  executes  an  instruction.  Ideally,  an 
application  process  should  request  a  resource  and  any  synchronization  should 
happen  as  a  side  effect.  Similarly,  if  a  site  crash  occurs  all  processes 
should  be  halted  as  a  side  effect  of  the  failure  and  not  due  to  the  processes 
checkl  g  the  processor  state. 

Simply  halting  the  progress  of  a  computation  is  insufficient  for 
simulating  a  site  failure.  The  state  of  a  computation  is  divided  between 
secondary  and  primary  storage.  Almost  all  primary  storage  is  volatile.  Hence 
when  a  site  crash  occurs,  a  certain  amount  of  the  state  of  a  computation  is 
lost.  This  requires  that  a  model  of  volatile  primary  storage  exhibit  the  loss 
of  information.  However,  if  a  model  simulates  memory  failures  it  still  needs 
to  be  able  to  simulate  recovery  which  requires  starting  a  process  at  a  defined 
point  with  its  computation  in  a  state  that  is  consistent  with  that  point.  It 
seems  as  though  a  model  must  take  snapshots  (e.g.,  checkpoints)  of  a 
computation's  evolving  state  and  correlate  those  wi'  different  points  in  the 
computation's  progress.  Two  problems  arise:  there  may  be  an  arbitrary  number 
of  such  points  and  how  a  modeler  knows  which  ones  to  select. 

Memory  failures  may  occur  Independently  of  site  crashes.  This  results  in 
the  same  problems  as  above  but  with  the  added  difficulty  of  only  part  of  the 
state  of  a  computation  being  lost  (e.g.,  the  state  may  be  resident  in  multiple 
primary  memories). 

There  are  three  kinds  of  communication  failures  of  interest  —  lost 
messages,  duplicate  messages,  and  link  failures.  All  of  the  failures  change 
the  effects  of  operations  on  objects  of  type  message.  The  first  two  occur 
intermittently  and  affect  a  single  message.  The  last  one  occurs  for  a  time 
interval  that  encompasses  many  messages. 

A  model  of  a  failure  should  include  the  ability  to  do  fault  dt  ectlon  and 
should  not  consume  raeasureable  resources  or  affect  events  that  are  independent 
of  it.  Link  failures  and  memory  failures  are  ex;  mples  of  resources  becoming 
u  lavailable  for  a  period  of  time.  During  that  period  the  failures  should  be 
detectable,  for  example  by  timeouts  or  parity  checks.  A  model  of  a  site 
failure  should  not  consume  resources,  so  any  computation  done  to  simulate  the 
failure  cannot  consume  memory  or  CPU  resources.  Similarly  the  discarding  of 
messages  to  simulate  the  loss  of  a  message  should  not  impact  the  CPU 
utilization  measured. 


Faults  may  be  Injected  into  a  simulation  either  deterministically  or 
probabilistically.  The  deterministic  approach  requires  an  explicit  statement 
as  to  when  a  fault  is  introduced.  It  may  be  signaled  either  by  an  explicit 
call  or  parameter  within  a  call.  It  may  be  triggered  based  on  the  state  of 
the  system.  For  example,  if  10  transactions  have  been  successfully  completed, 
lose  the  prepare  message  issued  by  the  transaction  commit  coordinator;  or  if 
there  is  an  object  in  the  commit  pending  state,  then  crash  the  site. 

The  probabilistic  approach  injects  faults  based  on  a  distribution  that  is 
independent  of  the  current  state  of  individual  computations.  The  signaling  of 
probabilistic  fault  injection  is  done  implicitly  by  a  routine  generating  a 
value  based  on  a  distribution  and  determining  whether  or  not  the  value 
generated  implies  that  a  fault  should  be  injected.  For  example,  a  network 
driver  routine  may  generate  a  value  based  on  a  uniform  distribution.  If  the 
value  falls  within  a  specified  range,  a  message  is  lost. 

The  selection  of  which  approach  to  fault  injection  to  use  is  made  based 
on  what  is  to  be  learned  from  a  model  and  how  much  effort  is  to  be  spent 
building  and  analyzing  a  model.  The  two  approaches  must  be  matched  to  the  use 
of  the  model  and  the  kinds  of  faults  that  must  be  injected.  For  example,  if  a 
model  is  to  be  used  to  demonstrate  that  a  communication  subsystem  provides 
reliable  message  delivery,  faults  in  the  form  of  lost  and  duplicate  messages 
may  be  introduced  probabilistically.  Other  times  it  may  be  more  important  to 
see  the  impact  of  a  specific  set  of  events  on  an  operation,  for  example,  the 
effect  of  losing  a  specific  message,  such  as  a  commit  message,  on  the  timeout 
period  and  window  of  vulnerability  for  a  commit  protocol.  This  example 
requires  deterministic  fault  injection  to  ensure  a  specific  ordering  of 
events.  In  general,  probabilistic  fault  injection  is  easier  to  develop  and 
use  within  a  model.  However,  probabilistic  injection  requires  more  runs  to 
ensure  coverage  of  all  possible,  event  sequences.  Hence  It  may  be  more 
expensive  in  terms  of  time  to  run  the  simulation  and  time  to  analyze  the 
results.  Deterministic  fault  injection  is  in  general  more  expensive  to 
develop  because  faults  may  have  to  be  generated  based  on  the  local  state  of 
individual  computations.  It  will  result  in  the  generation  of  all  requested 
sequences  of  events  in  a  minimal  amount  of  simulation  time.  However,  the 
deterministic  approach  will  only  generate  those  sequences  of  events  desired. 
There  may  be  an  equally  important  sequence  of  events  that  is  not  generated 
because  the  analyst  has  not  specified  its  inclusion. 

6. 4. 3. 2  Summary 

The  system  designers  guidebook  demonstrates  how  failures  in  a  distributed 
environment  may  be  modeled  using  Path  Pascal,  a  process  oriented  simulation 
language.  It  is  useful  to  summarize  Path  Pascal's  strengths  and  weaknesses  in 
terms  of  our  previous  discussion  on  the  requirements  for  a  simulation 
language.  Path  Pascal  does  support  multiple  concurrent  activities  through  its 
"process"  construct.  This  allows  the  modeling  of  multiple  sites,  each  of 
which  has  multiple  applications  executing  concurrently.  P^th  expressions 
provide  a  means  for  controlling  shared  data.  If  the  state  of  a  process' 
progress  is  a  shared  resource  (e.g.,  encapsulated  within  an  object),  it  may  be 
accessed  by  multiple  processes.  Further  if  processes  are  divided  into  system 
processes  and  application  processes,  each  of  which  executes  a  disjoint  set  of 
operations  on  the  state  information,  the  proper  interleaving  of  the  operations 
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will  allow  the  progress  of  a  process  and  its  state  changes  to  be  controlled.. 
Because  the  language  has  the  full  descriptive  capacity  of  a  prograaaing 
language  and  because  it  is  process-oriented,  the  intricacies  auid  side  effects 
of  a  failure  aay  be  captured. 

There  are  a  nuaber  of  deficiencies  in  the  language  for  our  purpose. 
Processes  may  be  easily  created  dynaaically,  but  there  is  no  language 
construct  for  destroying  them.  Process  destruction  aay  be  achieved  by 
manipulating  the  heap  froa  which  they  are  allocated,  but  this  is  tricky  auid  is 
discouraged.  A  desirable  aechanisa  is  one  that  slaultaneously  interrupts  all 
processes  of  a  given  class  (e.g.,  those  executing  on  a  specified  processor)  in 
order  to  simulate  site  failures.  Unfortunately,  there  is  no  relation  (e.g., 
hierarchy  or  classes)  between  processes,  and  there  la  no  way  of 
instantaneously  interrupting  a  process. 

Path  expressions  are  intended  for  controlling  the  access  to  shared  data 
by  multiple  processes.  As  such,  path  expressions  are  schedulers.  However,  it 
is  difficult  to  use  path  expressions  for  scheduling  processes  for  certain 
types  of  condition  synchronization  [ANDR83].  The  way  that  the  state  of  sites 
is  disseminated  when  a  failure  occurs  demonstrates  one  kind  of  condition 
synchronization  between  processes.  However,  often  one  nay  wish  to  express  the 
states  a  process  can  go  through  as  a  form  of  condition  S3mchronlzatlon  with 
itself.  For  example,  an  application  process  invoking  a  transaction  has  the 
following  specification:  execute  begin  transaction  once,  followed  by  some 
number  i,  0  <=  i  <=  n,  of  object  operations,  and  concluded  with  one  abort  or 
one  end  transaction.  Path  expressions  cannot  solve  this  problem;  ad  hoc 
solutions  are  required. 
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PERFORhitNCE  EVALUATIOM  OF  THE  ZEUS  SYSTEM 


This  chapter  describes  the  approach  followed  in  the  jaerforaance 
evaluation  of  the  Zeus  system.  The  goal  of  this  performance  evaluation  work 
was  to: 

( 1 )  Develop  and  Illustrate  modeling  of  recovery  mechanisms  and  faults  in 
distributed  systems; 

(2)  Illustrate  how  to  measure  differential  cost  of  introducing  recovery 
mechanisms  into  distributed  system  designs.  The  performance  evaluations 
focus  on  measuring  the  differential  cost  of  introducing  recovery 
mechanisms  in  terms  of  degradation  in  response  times  and  throughputs  of 
various  Job  classes; 

(3)  Illustrate  approaches  for  comparing  various  design  options  of  a  recovery 
mechanism  for  an  application  environment’s  fault  characteristics  and  then 
selecting  an  option  based  on  the  results  of  such  comparisons; 

(4)  Evaluate  some  commit  protocols  under  various  workloads  and  fault 
characteristics  of  the  operating  environment. 

The  guidebook  presents  a  detailed  description  of  how  to  model  failures 
and  recovery  mechanisms  in  a  distributed  system  using  PAWS  and  Path  Pascal. 
The  kinds  of  failures  considered  are  site  crashes,  disk  crashes,  link 
failures,  message  loss,  and  duplication  of  messages.  The  recovery  mechMisms 
considered  are  commit  protocols,  reliable  remote  procedure  calls,  atomic 
actions,  stable  storage,  careful  replacement,  object  replication, 
checkpointing,  and  rollback. 

The  overhead  introduced  by  a  recovery  mechanism  is  an  important 
evaluation  criterion;  in  order  to  measure  this  the  simulation  models  without 
both  the  recovery  mechanism  and  the  system  failures  are  executed,  next  the 
models  with  recovery  achanisms  but  without  system  failures  are  executed,  and 
finally  the  models  containing  both  the  recovery  mechanisms  and  the  failures 
are  executed.  The  degradation  of  performance  from  the  first  to  the  second 
evaluation  indicates  the  effects  of  overhead  because  of  introducing  recovery 
mechanisms  during  the  normal  operations  on  the  performance.  The  degradation 
of  performance  from  the  second  to  the  third  evaluation  indicates  the  effect  of 
overhead  in  taking  recovery  actions  due  to  the  failures  conditions  introduced 
in  the  system.  This  depends  on  the  rate  of  fault  injection. 
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The  coaparative  evaluation  of  various  design  options  of  a  recovery 
mechanlsB  Is  Important  In  selecting  one  of  the  options  for  a  systea  design 
depending  on  the  failure  characteristics  of  its  application  environment  and 
the  characteristics  of  the  Jobs  executed  by  the  system.  There  are  two  parts 
of  the  coaparative  evaluation.  The  first  part  consists  of  evaluating  each 
design  option  of  a  given  set  of  failure  rates  In  the  system,  starting  with  no 
failure  case.  The  second  part  consists  of  determining  the  effect  of  these 
options  on  different  Job  categories  In  the  systea.  In  the  next  section  we 
describe  a  set  of  generic  Job  categories  In  the  system.  The  models  are 
executed  with  a  workload  consisting  of  a  variety  of  such  Jobs.  The 
performance  measures  are  collected  for  each  Job  class.  This  helps  In 
determining  which  Job  classes  are  more  sensitive  to  the  the  various  design 
options  and  under  what  kind  of  failure  environments.  Thus,  given  certain 
application  system  along  with  its  Job  mix  characteristics  and  failure 
environment,  the  designer  can  determine  which  option  Is  most  suitable  for 
implementation. 

As  an  example  to  Illustrate  these  Ideas  we  have  performed  comparison  of 
the  Presumed  Abort  vs.  Presumed  Commit  protocol,  and  one-phase  vs.  two  phase 
commit  option.  These  evaluation  results  are  presented  In  the  last  section  of 
this  chapter. 

7.1  Model  Overview 

As  described  in  the  previous  chapter,  there  are  three  components  of  a 
performance  model  —  environment,  system  structure,  and  workload.  The 
environment  captures  the  standard  hardware  and  software  as  well  as  the  effect 
of  the  physical  environment.  The  Zeus  environment  included  the  following: 
configuration  of  the  system,  the  performance  attributes  of  its  components,  and 
operational  conditions  such  as  failures  and  their  rates.  The  system  structure 
captures  the  architecture  of  the  parts  of  the  model,  both  hardware  and 
software,  that  are  being  analyzed.  For  example,  the  Zeus  object  managers  with 
their  consistency  and  recovery  mechanisms  are  part  of  the  system  structure  as 
are  the  command  and  control  object  managers.  Finally,  the  workload  captures 
the  pattern  and  frequency  of  usage  of  various  resources  in  the  systea  as 
derived  from  the  execution  of  the  application  systems.  This  includes  the 
definition  of  the  classes  of  command  and  control  Jobs.  The  components  of  the 
model  used  for  this  effort  are  described  below. 

7.1.1  Model  Environment 

The  model's  environment  consisted  of  the  system  configuration  and  fault 
injection  function.  The  system  configuration  consisted  of  seven  sites 
interconnected  by  a  local  area  network.  Rather  than  model  the  Intricacies  of 
transmission  on  a  local  network  (e.g.,  link  control,  medium  access  control, 
physical  control,  etc.),  a  delay  with  an  exponential  distribution  that 
approximated  the  time  of  an  end  -to-end  message  was  used.  The  hardware 
configuration  at  each  site  was  identical.  It  consisted  of  one  cpu  and  five 
disks.  Two  of  the  disks  were  configured  to  be  a  stable  disk  (l.e.,  the 
contents  of  the  two  physical  disks  were  Identical).  A  central  server  model 
was  used,  with  a  process  requesting  CPU  usage  and  then  disk  usage. 


79 


All  failures  were  assumed  to  be  clean.  The  types  of  faults  injected  were 
site  crashes  and  disks.  It  was  assumed  that  a  disk  crash  caused  the  storage 
medium  to  be  corrupted  and  resulted  in  the  database  being  reconstructed  using 
some  combination  of  restoration  from  an  archival  version  and  processing  based 
on  a  log.  The  failure  rate  was  varied  from  no  failures,  to  a  few  failures, 
and  finally  to  a  couple  of  order  of  magnitudes  increase  in  the  number  of 
failures.  This  provided  a  base  case  of  operation  in  a  fault  free  environment 
to  compare  with  the  expected  case  of  a  few  faults  and  an  extreme  case  of  many 
faults. 

7.1.2  Model  System  Structure 

The  system  structure  consisted  of  a  number  of  object  managers  assigned  to 
different  sites  and  a  number  of  transactions  that  could  be  initiated  from 
different  sites.  Each  site  had  a  process/transaction  manager  to  handle 
operations  such  as  begin  transaction,  end  transaction,  and  abort  transaction. 
The  object  managers  that  performed  C2  operations  were  instantiations  of  the 
generic  object  manager  for  exemplary  C2  objects.  Each  object  manager 
performed  operations  such  as  concurrency  control,  commit  processing,  and 
transaction  undo.  The  amount  of  time  to  do  an  object  manager  specific 
operation  was  determined  based  on  the  number  of  objects  accessed  by  an 
operation  (see  workload  discussion).  In  addition,  each  site  had  the 
equivalent  of  the  operating  system  support  for  providing  transparency  of 
object  and  object  manager  location.  The  details  of  these  object  managers  and 
functions  are  contained  in  the  appendix  of  the  system  designers  guidebook. 

Table  7-1  describes  the  configuration  of  the  C2  object  managers  for  the 
performance  evaluation.  For  each  object  manager  the  following  information  is 
listed:  the  name  of  the  object  manger  (e.g.,  the  type  of  object  managed),  the 
number  of  instances  of  objects,  the  sites  where  an  instance  of  the  object 
manager  exists,  and  whether  or  not  instances  of  objects  that  are  managed  at 
multiple  sites  are  copies  that  are  maintained  with  strong  consistency.  The 
names  of  the  sites  have  the  following  meaning:  TACC  is  tactical  air  command 
center,  CRC  is  control  reporting  center,  AS1  is  air  squadron  1,  and  AS2  is  air 
squadron  2.  There  are  three  additional  sites  that  have  none  of  the  listed  C2 
object  managers.  They  are  FACP1,  FACP2,  and  FACP3  (forward  area  control 
posts).  Transactions  may  originate  from  any  of  the  sites. 

Table  7-1.  C2  Object  Manager  Configuration 


Object  Manager 

Number  Objects 

Sites 

Replicated 

Intelligence 

80 

TACC,  CRC 

yes 

N  .vigation 

80 

TACC,  CRC 

yes 

Supplies 

80 

AS1,  AS2 

no 

Mission  Plans 

120 

TACC,  CRC, 

AS1,  AS2  no 

Squadron 

40 

TACC,  AS1, 

AS2  no 

Weather 

80 

TACC,  CRC 

no 

7.1.3  Model  Workload 

In  evaluating  the  performance  of  the  Zeus  design,  we  were  interested  in 
examining  the  characteristics  of  the  system  with  a  wide  variety  of  different 
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Job  types.  Ne  were  particularly  Interested  In  the  effect  of  the  different 
recovery  nechanlsa  design  options  on  the  perforaance  of  different  Job  classes. 
However,  since  no  application  software  was  available,  a  nuaber  of  generic 

scenarios  were  defined  in  terms  of  several  performance-affecting  attributes. 
Four  Job  attributes  and  two  possible  values  for  each  attribute  were  defined  as 
follows: 

0  Duration  (short  or  long)  -  The  total  number  of  type  manager  operations 
executed  by  a  scenario.  The  difference  between  "short”  and  "long” 

scenarios  Is  about  an  order  of  magnitude. 

0  Number  of  Objects  Accessed  (few  or  many)  -  The  total  number  of  objects 
which  are  either  read,  written,  or  read  and  written  by  a  scenario.  The 
difference  in  magnitude  between  "few"  and  "many"  object  accesses  is  about 
an  order  of  magnitude. 

o  R/W  Ratio  (R/0  or  update)  -  This  Indicates  whether  or  not  the  scenario  does 
any  update  operations  on  ANY  of  the  objects  that  is  accesses.  R/0  Jobs  do 

not  do  any  update  operations  whereas  "update”  Jobs  do  at  least  one. 

0  Object  Distribution  (single-  or  multi-site)  -  If  all  the  objects  accessed 
by  a  Job  reside  on  a  single  host  (not  necessarily  the  same  one  that  the 
scenario  is  running  on),  then  the  value  of  this  attribute  is  "single-site”, 
otherwise,  it  is  "multi-site." 

Of  the  sixteen  possible  generic  Jobs  classes  that  may  be  obtained  by 
substituting  values  for  these  four  attributes,  eight  were  chosen  based  on 
information  about  existing  C2  applications.  Table  7-2  summarizes  the 
attributes  of  these  eight  Jobs  and  defines  an  instruction  mix  distribution  for 
them.  The  percentage  figures  in  the  Job  mix  column  indicate  the  percentage  of 
the  total  number  of  Jobs  resident  in  the  system  at  any  given  time  (after  a 
steady-state  has  been  reached). 


Table  7-2.  Job  Mix  Description 
Job  Job  #  Objects  R/W  Object 


Number 

Mix  {%) 

Duration 

Accessed 

Ratio 

Dlstrib. 

1 

15 

short 

few 

R/0 

multi-site 

2 

15 

short 

few 

R/0 

single-site 

3 

15 

short 

few 

update 

single-site 

4 

15 

short 

few 

update 

multi-site 

5 

20 

short 

many 

update 

multi-site 

6 

5 

long 

few 

update 

multi-site 

7 

10 

long 

few 

update 

single-site 

8 

5 

long 

many 

update 

multi-site 

By  way  of  Justification  for  the  Job  mix  given  here,  notice  the  following 
things: 
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0  B0%  of  the  concurrently  executing  jobs  are  short  -  that  is,  they  perform 
relatively  few  operations, 

o  75%  of  the  Jobs  have  a  relatively  small  working  set  (access  only  a  few 
objects) , 

o  Many  of  the  Jobs  (30%)  are  read-only. 


Although  this  method  of  selecting  an  example  workload  may  seem  somewhat 
ad  hoc.  It  is  expected  that  it  will  provide  valuable  performance  data  that  is 
sufficiently  accurate  to  guide  the  design  process.  More  importantly  for  our 
present  purposes,  it  provides  a  concrete  example  of  the  use  of  instruction 
mixes  in  real  design  situations. 

For  each  Job  description  in  the  mix,  a  number  of  exemplary  Jobs  were 
required.  Synthetic  Jobs  were  used  since  we  were  modeling  a  pre-opera tional 
systems.  These  are  artificial  Jobs  for  which  the  resource  usage  is  similar  to 
the  expected  characteristics  of  some  future  real  Jobs  or  to  existing  Jobs 
being  run  on  other  systems. 

Synthetic  Jobs  are  easier  to  obtain  than  the  real  applications  because 
they  summarize  the  resource  utilization  of  the  Jobs  that  they  are 
characterizing.  Local  CPU  usage,  for  example,  is  usually  represented  by  a 
simple  idle  loop  in  a  synthetic  Job  and  remote  requests  are  abstracted  so  that 
the  parameters  of  the  calls  are  simplified  or  left  out  entirely. 

As  an  example  of  a  synthetic  user-level  scenario  from  the  Zeus 
performance  analysis,  consider  the  following  Job: 

Begin  Scenario 

Read  Intelligence 
Read  Navigation 
Read  Weather 
Read  Mission-Plan 
Computation 
Update  Mission-Plan 
End  Scenario 

It  is  assumed  that  "Intelligence”,  "Navigation",  "Weather",  and  "Mission-Plan" 
are  objects  (or  groups  of  objects)  in  the  command  and  control  system.  The 
semantics  of  the  scenario  is  meant  to  resemble  a  so-called  "Mission  Control" 
Job  which  is  a  typical  (although  much  simplified)  Job  being  run  on  other  C2 
systems.  The  scenario  reads  the  appropriate  data,  does  some  local  computation 
to  determine  how  the  plan  of  some  in-progress  mission  should  be  changed,  and 
then  updates  that  mission  plan. 

Notice  how  this  synthetic  Job  represents  the  basic  performance-affecting 
features  of  the  scenario  in  a  very  stylized  and  simplified  way.  The  meaning 
and  functionality  of  the  local  computation  is  not  specified  and  neither  are 
the  parameters  of  the  remote  calls.  A  complete  description  of  the  Jobs  is 
contained  in  the  appendix  of  the  system  designers  guidebook. 
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PERFORMANCE  EVALUATION  OF  THE  ZEUS  SYSTEM 


7.2  Goals  of  the  Exaaple  Evaluation 

The  exaaple  evaluation  is  performed  with  the  objective  of  evaluating 
certain  design  options  for  commit  protocols.  Specifically,  in  this  evaluation 
we  Investigate  the  Presume  Commit  vs.  Presume  Abort  option,  and  one  phase 
commit  vs.  two  phase  commit  option. 

The  Presume  Abort  protocol  implies  that  in  the  absence  of  any  information 
about  a  transaction's  commitment  at  its  coordinator,  it  is  presumed  that  the 
transaction  was  aborted.  This  means  that  in  case  of  commiting  a  transaction, 
the  coordinator  must  keep  the  transaction's  commit  status  information  until  it 
is  certain  that  no  status  queries  for  that  transaction  would  be  received  in 
future.  Analogously,  the  Presumed  Commit  protocol  implies  that  in  the  absence 
of  commit  information,  a  transaction  is  presumed  to  have  committed.  Thus, 
when  a  transaction  is  aborted,  the  coordinator  must  maintain  its  abort  status 
until  it  is  certain  that  no  more  status  queries  for  that  transaction  would  be 
received  in  future.  In  a  fault-free  environment  where  most  of  the  transaction 
get  committed,  it  looks  more  attractive  to  follow  the  Presumed  Commit 
protocol;  this  avoids  synchronous  disk  writes  for  a  large  number  of 
transactions.  On  the  other  side,  if  a  system  encounters  large  number  of 
failures  that  lead  to  the  majority  of  transactions  to  be  aborted,  it  looks 
more  attractive  to  use  the  Presumed  Abort  protocol.  One  can  observe  that  as 
the  failure  rate  in  a  system  is  increased  starting  with  a  fault-free  system, 
there  exists  a  point  of  inflexion  where  it  is  more  advantageous  to  use  the 
Presume  Abort  protocol.  This  point  of  inflexion  can  be  obtained  by  executing 
the  model  with  varying  rates  of  fault  injection. 

Another  design  option  that  we  investigated  is  one  phase  vs.  two  phase 
commitment.  The  one  phase  commit  protocol  implies  that  in  response  to  every 
update  operation  on  an  object,  its  Type  Manager  creates  a  new  commit  pending 
version  of  the  object  on  the  stable  storage.  The  object  remains  in  the  commit 
pending  state  until  the  Type  Manager  receives  the  decision  about  the 
commitment/abortion  of  the  client  transaction.  The  commit  pending  state 
implies  that  the  object  can  not  be  used  by  other  clients  until  the  commit 
decision  is  received  from  the  coordinator.  A  coordinator  failure  while  an 
object  is  in  the  commit  pending  state  will  cause  the  object  to  remain 
unavailable  to  other  clients.  The  period  during  which  an  object  is  in  the 
commit  pending  state  is  called  its  in-doubt  period.  The  two  phase  commit 
protocol  tends  to  reduce  this  window  of  vulnerability.  In  this  protocol,  e;  h 
update  operation  creates  an  uncommitted  version  of  the  object.  At  the  end  of 
a  transaction,  its  coordinator  executes  a  protocol  which  first  attempts  to 
make  every  object  accessed  by  that  transaction  commit  pending.  After  this 
phase,  it  makes  the  commit/abort  decision.  The  two  phase  commit  protocol 
requires  additional  messages,  but  it  tends  to  reduce  the  window  of 
vulnerability. 

Obviously,  the  one  phase  commit  protocol  is  preferred  if  there  are  few 
failures  in  the  system;  however,  in  an  environment  where  the  failure  rates  are 
high,  it  is  more  desirable  to  use  the  two  phase  commit  protocol.  The  two 
phase  commit  protocol  introduces  overheads  in  terms  of  extra  messages  and  disk 
I/Os.  These  overheads  may  not  be  justflable  for  short  transactions.  In  our 
example  evaluations  we  investigate  how  to  determine  which  option  would  be  most 


suitable  for  a  given  application. 


7*3  Suuary  of  the  Evaluation  Data 

To  do  the  evaluation  we  chose  to  hold  the  hardware  architecture  constant, 
the  software  object  aanager  to  processor  allocation  constant,  and  the 
workload  generation  constant.  A  set  of  five  fault  Injection  rates  were 
defined.  They  varied  froa  no  faults  to  12.8  fault  per  100  seconds.  The 
system  structure  was  varied  by  using  different  coaalt  protocols  between 
the  generic  object  aanager  and  process/transaction  aanager.  Four 
different  coaalt  protocols  were  aodeled  —  one  phase  presumed  abort,  two 
phase  presumed  abort,  one  phase  presumed  coaalt,  and  two  phase  presuaed 
coaalt.  The  model  was  run  for  each  protocol  for  each  failure  rate  until 
1000  transactions  had  successfully  coaaltted.  Some  of  the  highlights  of 
the  analysis  are  suaaarlzed  here. 

Figure  7-1  shows  the  effect  of  a  coaalt  protocol  on  the  throughput  of  the 
overall  workload  as  the  failure  rate  Increases.  The  overall  transaction 
throughput  suaaary  demonstrates  the  performance  degradation  due  to  the 
increasing  occurrence  of  faults.  There  are  three  main  points  to  note  — 
the  effect  of  presuaed  abort  versus  presumed  coaalt,  of  two  phase  versus 
one  phase,  and  of  tlaeout  periods.  Presuaed  coaalt  protocols  outperfora 
presuaed  abort  protocols  for  low  fault  rates  as  expected.  But  for  one 
phase  protocols,  the  presuaed  abort  protocol  outperforms  the  presuaed 
'coaalt  protocol  when  the  fault  rate  execeeds  5  faults/ 100  seconds.  This 
Indicates  that  It  aay  be  desirable  to  have  an  adaptive  coaalt  alogritha 
that  uses  a  presuaed  coaalt  protocol  when  the  environment  Is  not  faulty 
and  swltchs  to  a  presuaed  abort  protocol  when  a  fault  rate  surpasses  a 
given  threshold. 

One  phase  protocols  outperfora  two  phase  protocols.  This  Is  not 
surprising  for  environments  with  a  low  fault  rate.  It  is  somewhat 
surprising  for  environments  with  a  very  high  fault  rate.  This  can  be 
explained  by  two  observations.  First,  the  slope  of  the  curve  of  a  two 
phase  protocol  tends  to  decrease  acre  rapidly  than  that  of  one  phase 
protocols  Indicating  that  there  aay  exist  soae  fault  rate  at  which  two 
phase  protocols  do  Indeed  outperfora  one  phase  protocols.  Second,  the 
model  of  the  tlae  duration  of  a  device  failure  Is  unrealistically  short. 
This  Is  due  to  the  excessive  tlae  and  resources  that  would  be  required  to 
run  a  simulation  that  accurately  modelled  a  device's  downtlae.  The  effect 
on  a  one  phase  coaalt  protocol  of  a  longer  downtlae  Is  to  Increase  the 
size  of  the  window  of  vulnerability  (or  in-doubt  period)  of  a  server  to 
the  failure  of  a  coordinator.  This  would  Increase  the  period  during  which 
a  set  of  objects  would  be  Indefinitely  blocked,  thereby  reducing  the 
potential  systea  concurrency  and  therefore  throughput. 

The  tlaeout  period  which  an  object  aanager  holds  a  lock  for  a  transaction 
using  a  two  phase  coaalt  protocol  can  unduly  effect  the  throughput.  A 
short  period  aay  result  In  aany  transactions  being  aborted  unnecessarily. 
This  Is  explained  as  follows.  As  the  aultlprograaaing  level  Increases, 
the  concurrency  level  of  the  object  aanagers  Increases.  When  the  objects 
are  reliably  aanlpulated  there  are  extra  disk  accesses  to  store  stable 


states.  The  Increased  nuaber  of  disk  accesses  results  in  a  bottleneck  at 
the  stable  storage  device.  This  results  in  a  longer  period  of  tlse  that  a 
transaction  waits  for  a  response  froe  an  object  manager  and  that  an  object 
manager  holds  a  lock  for  a  transaction.  As  the  stable  storage  request 
queue  grows,  the  number  of  aborted  transactions  Increases.  Note  that  this 
does  not  happen  for  one  phase  commit  protocols  because  objects  are  placed 
in  a  commit  pending  state  following  their  first  access,  nor  does  it  happen 
for  environments  with  a  high  fault  rate  because  the  multiprogramming  level 
Is  reduced.  This  problem  can  be  avoided  for  a  two  phase  protocol.  If 
either  the  multiprogramming  level  la  reduced  or  the  timeout  period 
Increased.  This  explains  why  the  throughput  of  the  two  phase  presumed 
abort  run  with  no  faults  is  as  low  as  the  few  faults  run. 


Figures  7-2  through  7-5  show  the  effect  of  the  commit  protocol  with 
varying  fault  rates  on  the  workload  mix.  It  demonstrates  that  as  the 
fault  rate  Increases  shorter  transactions  tend  to  dominate  the  mix  of 
successful  transactions.  Short  appears  not  to  be  sensitive  to  the  nuaber 
of  objects  accessed,  but  to  the  number  of  operations  on  the  set  of 
objects.  Further,  the  overhead  of  a  two  phase  protocol  did  not  seem  to  be 
warranted  for  short  transactions  under  any  fault  rate. 


In  this  chapter  we  have  presented  the  goals  of  the  example  system 
evaluation,  the  description  of  the  example  system,  and  the  evaluation  of 
some  commit  protocols  under  various  failure  characteristics  of  the 
operating  environment.  A  detailed  description  of  this  evaluation  Is 
presented  In  the  guidebook. 


Figure  7-1.  Effect  of  failure  rate  and  commit  protocol  on  throughput. 
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Presuned  Abort  on  Transaction  Mix  Effect  of  2  Phase  Presumed  Commit  on  Transaction  Mix 


Figure  7-2  -  7-5*  Effect  of  failure  rate  and  couit  protocol  on  Job  alz 


CHAPTER  8 


FUTURE  DIRECTIONS 


The  long  range  goal  of  research  on  distributed  system  recovery  mechanisms 
is  to  develop  a  methodology  that  allows  a  system  designer  to  select  an 
appropriate  set  of  recovery  mechanisms  for  a  given  system  environment  and 
workload.  The  work  done  for  this  contract  has  created  a  foundation  for  this 
goal.  This  report  has  presented  a  summary  of  that  work.  The  details  of  the 
work  are  reported  in  the  system  designers  guidebook.  There  are  a  number  of 
areas  which  warrant  further  research.  In  this  chapter,  we  discuss  four  areas 

system  structuring,  analysis  and  validation  techniques,  design 
specifications,  and  a  designer's  workbench. 

8.1  System  Structuring 

System  structuring  topics  to  explore  include  the  following:  advanced 
studies  in  reliability  techniques,  reliable  process  oriented  systems,  the 
impact  of  failures  on  system  security,  implementation  issues  for  object 
oriented  systems,  and  distributed  programming  environments.  Advanced  studies 
ir.co  reliability  techniques  can  be  conducted  in  either  an  application  specific 
or  application  independent  manner.  An  application  specific  approach  examines 
the  requirements  of  a  specific  application  and  produces  results  that  are  very 
appropriate  for  a  specific  class  of  applications.  There  is  the  potential 
drawback  that  the  results  may  not  be  generalizable  to  other  classes  of 
applications.  The  goal  of  research  in  this  area  is  to  develop  techniques  that 
provide  increases  in  the  performance  and  reliability  of  recovery  mechanisms 
for  distributed  command  and  control  applications.  The  development  of 
non-serializable  transaction  processing  techniques  that  take  into  account  the 
semantics  of  command  and  control  operations  is  a  fruitful  area  for 
exploration.  This  research  may  be  pursued  either  through  detailed  simulation 
or  experimental  evaluation. 

Research  into  generic  reliability  techniques  has  the  goal  of  producing  a 
handbook  of  distributed  system  recovery  mechanisms.  The  handbook  would 
describe  the  performance  and  reliability  of  generic  recovery  mechanisms,  and 
identify  what  mechanisms  are  appropriate  for  what  kind  of  applications.  There 
are  two  avenues  of  investigation.  The  first  is  to  develop  new  algorithms, 
analyze  their  performance  and  reliability  attributes,  and  determine  for  what 
applications  they  are  useful.  The  development  of  a  general  theory  of 
non-serializable  transaction  processing  based  on  the  semantics  of  operations 
and/or  probabilistic  decision  making  is  an  example  of  this  kind  of 
exploration.  Algorithms  and  strategies  to  dynamically  partition,  assign,  and 
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reconfigure  objects  that  result  in  increases  in  performance  and  reliability 
are  another  example.  The  second  is  to  do  a  detailed  study  of  existing 
mechanisms.  A  study  could  compare  for  the  management  of  recoverable  objects 
different  techniques  such  as  differential  files,  logs,  and  careful 
replacement;  for  reliable  transactions  the  effect  of  concurrency,  deadlock, 
timeouts,  and  failures  in  conjunction  with  commit  protocols;  and  for 
replication  management  the  effect  of  different  levels  of  replication  and 
network  partitions  on  consistency  and  recovery  techniques.  These  explorations 
may  be  done  either  through  the  use  of  detailed  simulation  or  experimental 
evaluation. 

The  work  done  on  this  contract  has  focused  on  object  oriented  system 
designs.  Many  of  the  existing  systems  have  been  developed  with  a  process 
oriented  structure.  There  are  two  possible  areas  of  exploration.  The  first 
considers  the  application  of  object  oriented  design  and  recovery  to  real  time 
systems,  a  set  of  applications  that  have  been  traditionally  developed  using 
process  oriented  techniques.  The  second  considers  existing  and  new  recovery 
techniques  for  reliable  process  oriented  systems.  The  techniques  can  be 
explored  through  detailed  simulation  or  experimental  evaluation. 

Security  is  one  area  of  operating  system  functions  that  this  work  did  not 
explore.  Existing  security  policies  are  based  on  centralized  management 
techniques;  it  is  assumed  that  a  system  is  either  running  or  stopped.  But  in 
a  distributed  system  it  is  possible  for  some  system  components  to  fail  and  for 
the  rest  of  the  system  to  continue  operating.  The  question  arises  as  to  what 
is  the  impact  of  failures  on  security. 

Zeus,  an  object  oriented  design  of  a  distributed  system,  was  developed 
and  used  for  modeling  in  this  contract.  It  was  demonstrated  that  object 
orientation  provides  a  number  of  advantages  for  recovery.  There  are  a  number 
of  unanswered  questions  about  how  a  reliable  object-oriented  distributed 
system  should  be  implemented.  There  are  two  kinds  of  problems  associated  with 
a  development  effort  for  building  a  Zeus-like  system.  The  first  aspect  is 
related  to  certain  generic  problems  in  implementing  objects.  Examples  of 
these  questions  include  how  to  efficiently  implement  functions,  such  as 
transparency  of  location,  replication,  failures,  and  concurrency.  The  second 
kind  of  problems  are  related  to  the  software  development  environment  such  as 
the  selection  of  appropriate  operating  system  kernel,  programming  language(s) 
and  tools  such  as  compilers,  linkers,  loaders,  debuggers,  etc.  The  software 
selection  is  a  difficult  task  because  of  the  small  number  of  languages  and 
tools  that  exist  for  distributed  environments.  This  is  a  critical  problem 
because  the  failure  to  obtain  the  proper  tools  may  require  additional  effort 
in  building  such  tools.  Building  a  Zeus-like  system  on  some  commercially 
available  workstation  along  with  its  operating  system  such  as  UNIX(1)  may 
require  tailoring  of  the  kernel  functions  to  facilitate  efficient 
implementation  of  recovery  mechanisms.  Such  modifications  to  the  host 
operating  system  are  engineering  research  problems  which  need  to  be 
investigated. 


(1)  UNIX  is  a  registered  trademark  of  Bell  Laboratories 
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FUTURE  DIRECTIONS 


Distributed  prograns  are  difficult  to  develop.  One  approach  to  easing 
their  development  is  to  use  an  object  oriented  approach  combined  with  a  run 
time  environment  that  provides  transparency  as  described  above.  A 
complementary  approach  is  to  integrate  high  level  non-procedural  constructs 
into  an  object  oriented  environment.  Tliis  further  eases  the  difficulty  and 
time  required  for  developing  distributed  programs,  resulting  in  an  increase  in 
programmer  productivity.  There  are  several  research  topics  associated  with 
such  a  distributed  programming  environment,  including:  non-procedural 
language  constructs,  translation  of  non-procedural  constructs  that  may  include 
conditionals  into  sequences  of  operations  on  objects,  and  the  partitioning  and 
assignment  of  objects  in  a  distributed  environment. 

8.2  Analysis  and  Validation 

The  design  evaluation  methods  cover  system  attributes  such  as 
reliability,  performance,  and  functional  correctness.  There  are  several 
directions  for  future  research  and  development  in  the  area  of  design 
evaluation  methods. 

In  the  area  of  performance  modeling,  there  is  a  need  to  investigate 
analytical  models,  possibly  based  on  Markov  chains,  of  distributed  system 
recovery  mechanisms  and  to  develop  analytical  techniques  to  predict  the 
performance  of  reliable  and  survivable  distributed  systems  using  these  models. 
Modeling  of  checkpointing,  rollback,  commit  protocols,  and  replication 
management  protocols  should  be  included  in  this  effort.  An  interesting  area 
of  investigaion  could  be  development  of  analytical  models  of  protocols  for 
replication  management  under  weak  consistency  requirements.  The  development 
of  analytical  models  that  faciliate  both  performance  and  reliability 
evaluations  and  their  interactions  in  a  fault  tolerant  system  is  an  important 
research  area.  The  performance  evaluation  of  the  example  system  in  this 
effort  is  based  on  simulations  using  PAWS.  Our  experience  in  this  effort 
indicates  that  it  is  desirable  to  have  an  advanced  simulation  language  that 
provides  convenient  mechanisms  for  modeling  faults  and  their  effects  in  a 
distributed  system.  For  example,  a  language  construct  that  stops  progress  of 
all  computations  associated  with  a  failed  component  would  be  useful. 

In  this  effort,  the  work  related  to  the  application  of  program 
verification  techniques  focused  on  the  construction  of  recoverable  objects  at 
a  single  site.  The  verfication  of  protocols  for  constructing  distributed 
recoverable  objects  using  program  verification  techniques  such  as  Gypsy  is  not 
completely  addressed  in  this  effort.  The  problem  of  protocol  verification 
needs  a  significant  level  of  additional  work.  In  this  contract  we  propose  an 
approach  using  Finite  State  Machines  and  interval  logic  to  reason  about  the 
correctness  of  such  protocols.  In  the  system  designers  guidebook  this  method 
is  developed  and  illustrated  using  an  example.  This  method  appears  promising 
because  it  is  simpler  than  program  verification  techniques.  Efforts  are 
needed  to  develop  a  formal  theory  for  the  verification  of  distributed  system 
recovery  protocols  using  this  method.  It  should  then  be  possible  to  build 
some  automated  tools  for  applying  this  method. 
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8.3  Formal  Methods  for  Design  Definition 


One  of  the  important  features  desired  in  a  design  definition  langauge  for 
reliable  systems  is  an  exception  condition  handling  model.  Concurrent  System 
Definition  Language,  which  does  not  have  this  feature,  can  be  extended  to 
include  an  exceptional  handling  model.  Formal  specification  of  functional 
requirements  along  with  their  performance  and  reliability  characteristics  is 
important  during  the  various  design  phases.  It  would  be  interesting  to 
investigate  such  a  specification  langauge  in  the  context  of  Ada  or  CSDL. 

8.4  System  Designers  Workbench 

The  system  designers  guidebook  presents  a  set  of  techniques  and  tools  for 
evaluating  reliable  distributed  system  designs.  These  tools  include  PAWS  for 
performance  evaluation,  NetRAT  for  reliability  evaluation,  Gypsy  for  formal 
verification,  and  Path  Pascal  for  functional  simulations.  One  can  envision  a 
system  which  integrates  these  tools  into  a  designers  workbench  system  which 
facilitates  the  convenient  application  of  these  tools  to  distributed  system 
design  expressed  in  some  formal  design  langauge.  This  workbench  would 
automatically  translate  a  design  expressed  in  the  design  langauge  to  the 
appropriate  evaluation  model  required  for  an  evaluation  tool.  It  would  also 
guide  the  designer  during  the  analysis  procedure  and  ask  questions  regarding 
any  information  that  is  necessary  for  evaluations  but  not  specified  in  the 
design. 

8.5  Recommendations 

Distributed  processing  research  is  in  a  state  of  flux.  There  are  an 
abundant  number  of  concepts  about  how  to  develop  systems  and  what  functions 
system  should  contain.  However,  there  is  a  shortage  of  experience  in  applying 
these  concepts  in  the  actual  development  of  systems.  The  insight  that  one  can 
gain  from  the  experimental  evaluation  of  a  system  differs  dramatically  from 
what  one  can  determine  from  modeling  and  analysis.  The  data  and  subsequent 
insight  that  is  needed  to  make  significant  progress  can  be  fostered  only 
through  the  actual  observation  of  a  phenomena.  Therefore,  it  is  strongly 
recommended  that  work  be  continued  on  the  general  topic  of  system  structuring 
using  experimental  evaluation.  To  better  ensure  the  relevance  of  future 
results,  we  recommend  an  approach  with  two  thrusts  that  may  require  the 
participation  of  multiple  organizations.  One  thrust  examines  command  and 
control  applications  in  detail,  resulting  in  a  detailed  design  of  a 
demonstrable  subset  of  a  command  and  control  application.  The  other  thrust 
pursues  the  experimental  evaluation  of  generic  reliability  techniques  using 
the  above  application  as  a  test  vehicle.  The  Implementation  should  provide  an 
object  based  distributed  operating  system,  the  use  of  non-procedural  language 
constructs,  and  tools  that  aid  in  the  partitioning  and  assignment  of  objects 
in  a  distributed  environment.  The  experimental  evaluation  should  provide  data 
as  to  how  the  recovery  mechanisms,  object-oriented  operating  systems,  and 
non-procedural  language  constructs  support  distributed  command  and  control 
applications. 
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CSDL:  CONCURRENT  SYSTEM  DEFINITION  LANGUAGE 


1 . 1  INTRODUCTION 

Concurrent  system  design  is  an  engineering  activity  which  requires  software 
engineering  technology  comprising  a  design  methodology,  design  methods  and 
languages,  and  tools  to  automate  its  procedures.  This  chapter  presents 
methodological  principles  and  linguistic  support  for  engineering 
constructively  correct  concurrent  system  designs.  The  Concurrent  System 
Definition  Language  (CSDL)  is  based  on  a  formal  model  of  computation,  giving 
it  both  rigorous  foundations  to  support  the  less  formal  creative  process,  and 
mechanical  means,  such  as  mapping  functions,  of  supporting  system  engineering. 
CSDL  integrates  software  development  techniques  which  have  not  been  combined 
before  (data  abstraction,  information  hiding,  temporal  logic,  Dijkstra's 
guarded  commands)  allowing  designers  to  create  and  reason  about  data, 
algorithms,  and  communication  architecture.  CSDL  contains  both  a  description 
and  a  specification  language,  permitting  designers  to  carry  out  the  entire 
design  process  in  the  same  syntactic  and  semantic  environment. 

CSDL  is  a  collection  of  seasoned  techniques,  mechanisms  and  language 
constructs  that  have  not  been  combined  before.  It  is  rewarding  to  discover 
that  many  significant  individual  contributions  to  software  design  can  be 
integrated  into  a  single  system  without  major  clashes,  and  that  they  do 
function  in  an  integrated  way  to  provide  the  desired  reasoning  vehicle  for 
constructing  verifiably  correct  designs.  It  is  also  rewarding  to  discover 
that  even  when  CSDL  is  used  informally,  it  succeeds  in  helping  designers 
produce  more  robust  designs  and  have  more  control  over  the  design  process. 

CSDL  succeeds  in  managing  design  complexity  by  encouraging,  almost  forcing,  an 
architectural  view  of  systems.  This  architectural  view  is  compatible  with 
both  top-down  and  bottora-up  design  styles;  in  each  case  a  collection  of  system 
elements  is  "hooked  together"  to  construct  a  system  that  satisfies  a 
specification.  CSDL  also  succeeds  as  a  means  of  producing  implementation 
blueprints  because  it  has  constructs  for  expressing  data  structure,  procedural 
behavior  and  communication  architecture. 

A  major  part  of  detailed  design  of  the  Zeus  operating  system  was  done  using 
CSDL.  These  designs  are  presented  Volume  II  of  this  guidebook. 

This  chapter  presents  CSDL's  computational  model  and  model  of  system 
architecture,  its  methodology  and  language  constructs,  an  example  using  it  and 
some  possible  directions  for  its  enhancement. 
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CSDL:  CONCURREMT  SYSTEM  DEFINITION  LANGUAf*? 


Some  requirements  we  identified  for  a  software  engineering  package  are  based 

on  our  recognition  of  software  engineering  as  an  instance  of  engineering  in 

general,  and  our  understanding  of  the  engineering  problem.  They  are: 

1.  A  well-defined  theoretical  model  of  computing  to  support  a  rational, 
creative  software  design  and  the  development  process.  The  model 
constitutes  scientific  underpinnings  that  define  the  way  engineers  think 
about  both  the  software  product  and  the  methods  and  components  used  to 
build  it. 

2.  Technical  methods  for  use  by  individual  developers  in  engineering  a 
partial  or  complete  software  system  design.  These  are  methods,  such  as 
data  abstraction,  which  the  individual  practitioner  applies  to  the  design 
task,  not  methods  such  as  version  control  or  configuration  control  that 
are  applied  to  managing  whole  projects. 

3.  Support  for  creative  freedom  and  realistic  analysis  of  correctness, 
feasibility,  and  economy  of  alternatives.  The  models,  methods  and 
languages  the  project  develops  must  accommodate  both  the  creative  ideas 
that  experienced  designers  get,  and  the  analytic  methods  they  use  to 
determine  the  effects  of  their  ideas. 

4.  Techniques  to  manage  and  reduce  the  complexity  of  the  design  process,  the 
resulting  design  and  the  design  document.  The  technical  methods  must 
include  facilities  for  decomposing  the  system  design  task  so  that  it 
becomes  intellectually  manageable. 

5.  A  design  language  that  presents  a  software  system  design  in  the  form  of  a 
description  explaining  how  the  system  is  built  up  from  smaller  pieces,  and 
how  those  pieces  are  connected,  along  with  a  specification  that  explicates 
the  expected  observable  behavior  of  the  system,  its  parts,  and  the 
mechanisms  that  connect  them.  A  system's  description  constitutes  the 
blueprint  of  components  and  interconnections  from  which  it  is  built;  its 
specification  presents  the  relevant  properties  of  the  device  as  a  whole, 
of  its  parts,  and  of  the  interconnection  mechanisms. 

6.  A  design  notation  that  expresses  the  product's  specifications  and 
descriptiou  in  the  terms  of  the  implementing  technology.  A  detailed 
design  is  solution  oriented,  so  it  is  expressed  using  statements  about  the 
processes,  input  and  output  values,  algorithms  and  memory  of  software 
technology,  not  the  user  interfaces,  applications  packages,  sensors, 
actuators,  or  transducers  of  user  requirements  technology. 

Other  requirements  for  the  CSDL  software  engineering  package  arise  from  the 

fact  that  it  is  meant  to  be  used  by  people.  They  include: 

0  The  models  and  languages  must  have  intuitive  appeal  to  software  designers 
and  programmers. 

0  The  methods  must  be  automatable,  so  that  machines,  not  humans,  can  deal  with 
detail. 
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These  requirements  led  to  the  development  or  adoption  of: 

1.  A  formal  model  of  sequential  and  concurrent  computations. 

2.  A  system  model  that  characterizes  the  building  blocks  with  which  systems 
may  be  designed. 

3.  Methodological  principles  and  guidelines  that  define  desirable  properties 
of  the  design  activity,  the  design  language  and  the  design  itself,  and 
make  procedural  suggestions  for  carrying  out  the  design  process. 

4.  Technical  methods  essential  for  engineering  software.  They  are,  for 

example,  data  abstraction,  procedural  abstraction,  Dijkstra's  constructive 
approach,  and  the  like. 

5.  A  description  language  -  a  formal  notation  for  describing  how  a  system  is 

built  up  from  pieces  and  how  those  pieces  are  connected.  Its  semantics 

are  based  on  the  model  of  computation. 

6.  A  specification  langauge  -  a  formal  notation  for  documenting  the  expected 
behavior  of  a  system  description.  Its  semantics  are  based  on  the  formal 
model. 

7.  Analytic  methods  for  investigating  operational  properties  such  as 

performance,  reliability,  or  security  of  alternative  functionally  correct 
system  designs. 


These  elements  are  applied  to  detailed  design,  development  phase  who.se  work 
product  is  a  design  documenting  a  system's  logical  architecture,  its  paths  of 
information  flow,  the  data  type  of  each  system  object  and  the  behavior  of  the 
system  and  each  of  its  modules.  A  detailed  design  expresses  what  will 
actually  be  implemented.  Each  object  in  the  design  —  module,  data  object, 
procedure,  or  information  flow  path  —  will  exist  in  the  implementation, 
though  the  object's  physical  realization  may  be  different  from  its  logical 
design.  For  example,  a  type  operation  designed  as  a  procedure  may  be 
implemented  with  in-line  code. 

The  remainder  of  this  chapter  is  organized  as  follows;  Section  1.2  presents 
CSDL's  underlying  formal  models  of  computation  and  system  architecture. 
Section  I.3  explains  the  CSDL  design  methodology  and  presents  the  major 
linguistic  features  that  support  it.  Section  1.4  presents  the  CSDL  constructs 
for  describing  and  specifying  system  designs.  Section  1.5  is  an  example 
system  design.  Section  1.6  draws  appropriate  conclusion,  outlines  possible 
improvements  and  suggests  direction  for  future  work. 
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CSDL:  CONCURRENT  SYSTEM  DEFINITION  LANGUAGE 


1.2  MODELS 

Both  CSDL's  computational  model  and  Its  system  model  define  ways  that  people 
can  think  about  the  software  they  build.  The  computational  model  is  a 
formalized,  '^Githematlcal  abstraction  of  the  phenomena  that  an  engineer 
manipulate.^  when  building  an  artifact.  It  is  the  vehicle  for  theoretical 
work.  The  system  model  is  qualitatively  different.  It  characterizes  the 
abstract  building  blocks  of  a  concurrent  system  design.  Thus,  the  system 
model  introduces  conceptual  constraints  on  the  model  of  computation,  implying 
that  CSDL's  computations  will  not  be  realized  in  all  the  ways  that  the 
computational  model  allows,  but  only  in  those  ways  which  can  occur  on  this 
conceptual  architecture.  CSDL's  computational  model  was  used  in  developing 
the  language  and  in  presenting  it  in  [FRAN83a]  and  [FRAN83b].  In  this 
section,  the  language  is  explained  more  informally  so  the  reader  may  choose  to 
skip  Section  1.2.1  However,  the  system  model  (Section  1.2.2)  is  used  in  the 
remainder  of  this  chapter. 


1.2.1  Computational  Models 

CSDL's  computational  model  formalizes  the  concepts  needed  to  talk  about  the 
structure  and  observable  effects  of  a  large  class  of  programming  mechanisms. 
Our  goal  is  a  precise  and  rigorous  model  that  expresses  the  essentials  of  the 
things  system  engineers  work  with  simply  and  intuitively.  Precision  and  rigor 
are  necessary  if  the  results  of  reasoning  in  the  model  are  to  be  trusted. 
Simplicity  and  intuitiveness  are  necessary  if  the  model  is  to  be  adopted  by 
people  whose  task  is  to  build  things,  not  to  philosophize  about  them. 

Section  1.2. 1.1  presents  a  model  of  sequential  computations;  Section  1.2. 1.2 
presents  a  model  of  concurrent  computations  and  Section  1.2.1. 3  explains 
system  histories  which  are  used  to  reason  about  system  behavior. 


1.2. 1.1  Sequential  Computations 

Our  model  of  sequential  computations  is  a  very  conventional  one  based  on  the 
primitive  notions  of  states  and  transitions.  This  model  appears  to  be 
sufficient  to  define  program  semantics  in  terms  of  effects  on  data  and 
parameters. 

A  data  object  x  is  an  entity  that  can  take  on  any  value  V(x)  of  a  specified 
set  of  values  T(x).  The  state  of  x  at  some  point  during  its  lifetime  is  its 
value  V(x)  at  that  point.  Given  a  set  X  of  n  objects  x1,...,xn,  where  each  xi 
is  of  type  Ti,  the  state  q(X)  at  some  point  in  the  lifetime  of  X  is  given  by 
the  vector  of  values  of  the  objects 

q(X)  =  <  V{x1),...,V(xn)  > 

at  that  point.  X  is  called  an  object  space ,  and  the  set  Q(X)  of  all  such 
vectors  is  called  the  state  space  of  X. 
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A  single  terminating  sequential  program  P  defined  over  an  object  space  X 
effects  a  state  transition  on  X  in  that  it  is  Invoked  with  X  in  one  state  q(X) 
and  terminates  with  X  in  some  state  q'(X).  The  effects  of  P  may  be  expressed 
by  a  binary  relation  [P]  over  Q(X).  The  interpretation  of  this  relation  is 
that  the  pair  <  q(X),  q'(X)  >  is  an  element  of  [P]  if  and  only  if  P  is 
guaranteed  to  halt  when  invoked  from  q(X),  and  q'(X)  is  one  of  the  states  in 
which  P  can  halt  when  Invoked  from  q(X).  With  this  interpretation  it  follows 
that  the  detain  of  [P],  that  is,  the  set  of  all  states  that  can  be  first 
elements  of  pairs  in  [P],  is  exactly  the  set  of  initial  states  from  which 
termination  is  guaranteed. 

An  execution  of  P  over  X  may  be  modeled  by  a  sequence  of  states  h(X,P): 

h(X,P)  =  qO(X);  ...  qn(X);  ... 

where  qO(X)  is  the  state  of  X  at  the  invocation  of  P.  If  qO(X)  is  an  element 
of  the  domain  of  [P]  then  h(X,P)  is  finite,  and  if  in  addition  qn(X)  is  the 
last  state  of  h(X,P)  then  <  qO(X),  qn{X)  >  is  an  element  of  [P]. 

Different  designs  of  a  program  that  has  a  given  desired  effect  may  be 
distinguished  by  their  possible  execution  sequences,  which  reflect  the 
intermediate  states  a  particular  design  will  pass  through  while  attaining  the 
desired  over-all  effect. 


1 . 2 . 1 . 2  Concurrent  Computations 

The  sequential  model  does  not  deal  with  the  notion  of  time  or  of  an  external 
environment  with  respect  to  which  a  program  causes  change  through  time.  These 
phenomena  are  treated  by  CSDL's  model  of  concurrent  computation. 

A  concurrent  computation  is  modeled  as  a  collection  of  m  sets  of  object  spaces 
Xl,...,Xra,  with  a  program  Pi  defined  over  each  Xi.  The  object  spaces  Xi  may 
overlap  or  have  elements  in  common.  This  occurs  in  a  CSDL  design  when  two 
programs  share  communication  objects.  Objects  may  move  from  space  to  space; 
that  is,  an  object  may  "instantaneously”  vanish  from  space  Xi  and  appear  in 
space  Xj.  This  occurs  in  a  CSDL  design  when  one  program  dynamically  creates  a 
process  and  gives  some  of  its  objects  to  the  newly  created  process.  Attempts 
by  multiple  programs  to  operate  on  a  shared  object  are  nondeterministically 
serialized.  The  transfer  of  an  object  from  one  space  to  another  is  serialized 
with  all  other  operations.  With  the  exception  of  serialization  of  operations 
on  shared  objects,  the  programs  Pi  over  the  object  spaces  Xi  proceed 
asynchronously  and  independently  of  one  another. 

Object  spaces  may  vanish,  and  new  ones  appear  throughout  the  lifetime  of  a 
concurrent  computation.  This  is  expressed  in  a  CSDL  design  by  dynamic  process 
creation  and  destruction.  The  computation  exists  as  long  as  at  least  one 
space  exists.  When  a  new  space  appears,  its  objects  may  all  be  new  or,  as 
stated  above,  some  of  them  may  come  from  an  existing  space.  When  a  space 
containing  such  "borrowed"  objects  subsequently  vanishes,  the  borrowed  objects 
are  returned  if  the  "lender"  still  exists.  Otherwise  they  vanish.  That  is, 
if  a  program  dynamically  creates  a  process,  gives  control  of  some  of  its 
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objects  to  the  new  process  and  then  destroys  the  process,  the  objects  it 
yielded  control  over  return  to  its  control. 

This  model  of  concurrent  computations  introduces  complications  into  the  model 
of  sequential  computations.  A  single  space  X  with  program  P  can  have  objects 
disappear  and  appear  throughout  its  lifetime  as  it  creates  and  destroys 
processes.  Also,  objects  that  are  shared  with  other  spaces  can  appear  locally 
to  change  state  asynchronously  with,  and  without  being  manipulated  by  P.  That 
is,  information  enters  from  P's  environment. 


1.2. 1.3  Histories 

An  execution  sequence  h{Xi,Pi)  of  a  program  Pi  over  object  space  Xi  is  a 
particular  (possible)  history  of  Xi.  The  set  H  =  {  h(X1 ,P1) , . . . ,h(Xn,Pn)  }  is 
a  possible  history  of  the  concurrent  computation.  H  may  have  some  elements 
that  appear  and  vanish  and  others  that  are  infinite;  this  corresponds  to  some 
object  spaces  appearing  and  vanishing  while  other  object  spaces  last  forever 
once  they  are  created.  The  only  restrictions  on  state  and  rate  of  the  various 
members  of  H  are  those  that  arise  from  shared  and  moving  objects.  In 
particular,  although  each  history  h(Xi,Pi)  is  totally  ordered,  there  is  only  a 
strict  partial  order,  precedes,  among  states  in  H.  The  precedes  relation  is 
the  basis  for  the  notion  of  temporal  order.  The  precedes  relation  may  be 
defined  recursively  as  follows: 

(1)  Within  a  history  h(Xi,Pi)  the  state  qj(Xi)  precedes  qk(Xi)  if  and  only 
if  j  <  k. 

(2)  If  X  is  shared  by  spaces  Xi  and  Xj,  if  qm(Xi)  and  qm+1(Xi)  are 
consecutive  states  in  h(Xi,Pi)  corresponding  to  a  change  in  x,  and  if 
qk(Xj)  and  qk+l(Xj)  are  consecutive  states  in  h(Xj,Pj)  corresponding  to 
the  same  change  in  x,  then  qm(Xi)  precedes  qk+1(Xj). 

(3)  If  qp(Xi),  qr(Xj),  and  qh(Xk)  are  states  anywhere  in  the  system,  and  if 
qp(Xi)  precedes  qr(Xj)  and  qr(Xj)  precedes  qh(Xk),  then  qp(Xl)  precedes 
qh(Xk). 

The  precedes  relation  is  partial  rather  than  total  because  there  can  exist 
distinct  states  qk(Xi)  and  qm(Xj)  neither  of  which  precedes  the  other.  For 
example,  suppose  Xi  and  Xj  share  x,  ai...  Curther  suppose  that  Pi  changes  x  from 
2  to  3  and  later  Pj  changes  x  from  3  to  4.  All  states  of  Xi  up  to  the  change 

from  2  to  3  precede  all  states  of  Xj  after  that  change,  and  all  states  of  Xj 

up  to  the  change  from  3  to  4  precede  all  states  of  Xi  after  the  change.  But 

the  states  of  Xi  after  the  change  from  2  to  3  but  before  the  change  from  3  to 

4  have  no  defined  relation  to  the  states  of  Xj  in  the  same  period;  they  are 
incomparable. 
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1.2.2  System  Model 

Our  system  model  defines  a  conceptual  architecture  for  concurrent  systems.  It 
limits  the  universe  of  designs  for  realizing  a  specified  computation  to  those 
that  can  be  defined  within  such  an  architecture. 

A  concurrent  system  is  a  collection  of  machines  which  operate  concurrently  and 
autonomously.  They  communicate  asynchronously  by  passing  information. 
Internally,  a  machine  consists  of  data  objects  and  procedures  and/or 
subordinate  machines  to  manipulate  these  objects.  A  machine  containing  only 
procedures  constitutes  a  sequential  locus  of  control.  A  machine  containing 
subordinate  machines  constitutes  several  autonomous  control  sites.  If  the 
system's  architecture  is  viewed  as  a  tree,  its  leaves  are  all  sequential 
control  sites. 

Machines  may  also  contain  machine  pools  from  which  machine  instances  may  be 
created  and  destroyed  as  the  system  runs. 

Systems  are  evolutionary.  The  initial  system  configuration  is  described  by  a 
distinguished  machine,  SYSTEM.  SYSTEM  may  contain  other  machines  and  machine 
pools.  Each  machine  that  SYSTEM  contains  may,  in  turn,  contain  other  machines 
and  machine  pools.  The  initial  system  is,  then,  the  configuration  consisting 
of  SYSTEM,  all  machines  it  contains,  and  all  the  machines  they  contain.  A 
system  evolves  by  dynamic  creation  and  destruction  of  machines  from  pools. 
Since  every  pool  element  may  contain  machines  and  machine  pools,  creating  a 
new  machine  dynamically  may,  in  effect,  create  a  new  subsystem. 

A  system's  communication  architecture  is  the  set  of  connections  among  its 
machines.  Connections  are  formed  among  active  objects,  objects  whose  values 
can  change  without  being  manipulated  by  the  machine  which  contains  them. 
Since  machines  cannot  manipulate  each  other's  objects,  a  communication  link  is 
set  up  by  connecting  an  active  object  in  one  machine  to  a  complementary 
(roughly  same  type,  opposite  direction)  active  object  in  another.  The  sending 
machine  puts  a  value  in  its  local  active  object,  and  that  value  is 
instantaneously  transmitted  to  the  complementary  active  object  from  which  the 
other  machine  can  get  it  by  a  local  operation.  Active  objects  may  be 
connected  to  realize  point-to-point,  multi-cast,  fan-in  and  broadcast 
communication  architectures.  Connected  active  objects  by  definition 
correspond  to  shared  objects  in  the  computational  model,  as  described  in 
Section  1.2. 1.2. 


1.3  METHODOLOGY 

CSDL's  methodology  comprises  design  guidelines  which  are  a  synthesis  of 
concepts  drawn  from  research  on  software  design.  They  offer  procedural 
suggestions  for  carrying  out  the  design  activity.  One  practical  effect  of 
adopting  design  guidelines  was  to  include  features  to  support  them  in  CSDL. 
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1.3«1  Constructive  Correctness 

Basically,  CSDL  advocates  building  in  correctness;  that  is,  using  a  system 
component's  specification  as  a  driver  for  constructing  its  description.  The 
constructive  design  process  at  every  level  begins  with  a  requirements 
analysis.  At  the  topmost  level  these  specifications  of  requirements  are 
usually  arrived  at  through  discussions  among  designers,  requirements  analysts, 
and  customers.  The  task  of  designing  procedures,  data,  and  machines  then  uses 
these  top  level  specifications  as  its  starting  point. 

A  procedure's  functional  specifications  are  obtained  by  constructing 
assertions  that  define  a  constraint  on  valid  inputs  and  the  relation  between 
valid  input  and  desired  output.  These  assertions  are  defined  over  the 
variables  global  to,  and  the  parameters  of,  the  procedure.  A  procedure's 
behavioral  specifications  are  obtained  by  constructing  assertions  that 
characterize  state  sequences  over  the  global  variables  and  parameters 
associated  with  invocations  of  the  procedure.  These  assertions  express 
properties  both  of  individual  invocations,  such  as  time  performance,  and  of 
sets  of  invocations,  such  as  mutual  exclusion  and  ordering  constraints. 

An  abstract  data  type's  functional  specifications  are  obtained  by  presenting  a 
model  of  the  type,  expressed  as  a  set  of  conceptual  data  objects,  and 
constructing  assertions  that  define  the  input  constraint  and  input/output 
relations  for  each  operation  defined  for  the  type.  An  abstract  data  type's 
behavioral  specifications  are  obtained  by  constructing  assertions  that 
characterize  state  sequences  over  the  model  that  are  associated  with 
invocations  of  the  type's  operations. 

A  machine's  functional  specifications  are  obtained  by  constructing  assertions 
that  characterize  the  relationship  between  its  output  sequences  and  its  input 
and  visible  state  sequences.  A  machine's  behavioral  specifications  are 
obtained  by  constructing  assertions  that  characterize  input,  state,  and  output 
sequences  that  satisfy  temporal  order  constraints  like  mutual  exclusion  and 
liveness  and  safety  properties,  and  temporal  metric  properties  like  time 
performance  and  time-out. 

The  results  of  a  design  step  are  procedure,  data  type  and  machine  designs,  and 
requirements  on  lower  level  mechanisms  which  together  imply  that  the  design 
meets  its  specifications.  The  design  process  then  returns  to  requirements 
analysis  and  design  of  the  lower  level  mechanisms.  Ideally,  design  proceeds 
in  a  net  top-down  fashion.  By  net  we  mean  that  a  design  step  will  be  finished 
before  the  designs  of  the  lower  level  mechanisms  it  uses  are  finished.  This 
allows  for  controlled  depth-first  exploration  and  backup  when  infeaslbilitles 
are  discovered.  A  new  level  is  reached  when  the  procedures  and  types  which 
realize  primitives  of  the  upper  level  are  to  be  designed  and  proved  correct 
with  respect  to  their  requirements. 

Machine,  procedure,  and  data  type  designs  are  arrived  at  by  procedural 
refinement  (introducing  subprocedures),  type  refinement  (designing  an  abstract 
type  model's  implementing  structure  and  its  operations'  algorithms),  and 
object  space  partitioning  (partitioning  a  machine's  permanent  objects  into 
disjoint  subsets,  each  of  which  is  manipulated  by  a  submachine). 
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For  example,  we  recommend  these  four  steps,  in  this  order,  for  designing  an 
abstract  data  type: 

o  Prepare  a  data  type  definition  that  specifies  type's  externally  visible 
behavior  and  externally  accessible  operations. 

o  Design  an  implementation  structure;  express  the  mapping  between  the 
type  definition's  model  and  the  implementation  structure. 

o  Re-express  the  type  definition's  specifications  in  terms  of  the 
implementing  structure,  thus  preparing  the  specifications  to  drive  the 
design  of  type  operations.  The  mapping  function  precisely  defines  this 
transformation. 

0  Design  a  data  type  refiner  containing  the  Implementation  structure  and 
procedures  to  Implement  the  type  operations.  These  procedure 
descriptions  will  satisfy  the  specifications  expressed  in  terras  of  the 
implementing  structure. 

CSDL  provides  the  specifications,  mapping  functions  and  data  type  models  to 
carry  out  these  steps.  The  process  is  similar  for  constructing  machines: 

0  design  external  interfaces  —  public  objects, 

0  specify  externally  visible  behavior  in  terms  of  the  visible  objects, 

0  introduce  private  objects,  including  other  machines,  to  realize  that 
externally  visible  behavior. 


1.3.2  Object  Orientation 

The  constructive  approach  guides  designers  in  obtaining  correct  data  types, 
procedures  and  machines  by  building  down.  CSDL's  second  major  design 
guideline,  object  orientation,  guides  designers  in  building  up  a  system  from 
correct  components.  A  system  is  viewed  as  a  collection  of  data  objects  and 
procedural  objects.  System  construction  is  the  process  of  combining 
previously  designed  data  and  procedural  objects  to  meet  the  system's  goals,  as 
expressed  in  the  system's  specifications.  Object  orientation  cuts  complexity 
because,  during  system  construction,  designers  deal  only  with  an  object's 
external  Interface,  as  presented  by  its  abstract  model  and  operations  (in  the 
case  of  an  abstract  data  type),  or  its  public  objects  and  external  behavior 
(in  the  case  of  a  machine).  The  same  specifications  that  drive  an  object's 
constructive  design  explicate  that  object's  properties  and  behavior  for  the 
purposes  of  system  construction. 
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1.3.3  Complexity  Management 

Constructive  correctness  and  object  orientation  combine  for  a  design  style  in 
which  complexity  is  controlled  both  in  the  design  of  individual  system 
components  and  in  the  overall  system  design,  and  correctness  is  maintained 
during  object  design  and  system  design  by  using  specifications  as  goals. 
Complexity  management  occurs  at  three  levels: 

o  within  a  single  component: 

Every  system  component  —  module,  data  type,  and  procedure  —  has  a 
specification;  the  design  which  meets  the  specification  is  created 
separately. 

o  between  a  component  and  its  clients: 

A  component  has  a  public  part:  its  specification  and  public  objects  (in 
the  case  of  a  module)  or  allowable  operations  and  attributes  (in  the  case 
of  an  abstract  data  type).  A  component's  private  part  realizes  its  public 
specification.  Modules  containing  objects  which  are  instances  of  abstract 
data  types  see  only  the  object's  type  specification,  its  allowable 
operations,  and  the  attributes  they  can  examine  or  manipulate.  A  type's 
representing  structure  and  the  procedures  which  implement  its  operations 
are  private.  Modules  interacting  with  a  module  see  its  specification,  the 
public  objects  that  constitute  its  visible  state,  and  the  public  objects 
through  which  they  can  exchange  information  with  it.  The  module's 
procedures  and  internal  data  objects  are  private. 

0  among  components: 

CSDL  allows  the  design  and  implementation  of  data  types,  procedures  and 
module  to  be  carried  out  independently  of  designing  the  system  that  uses 
all  these  objects. 


1.3.4  Linguistic  Support  for  the  Design  Guidelines 

The  table  below  matches  detailed  technical  methods  for  constructively  correct, 
object  oriented  design  with  the  CSDL  features  that  support  each  one. 
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TECHNICAL  METHOD 


LANGUAGE  SUPPORT 


1 .  Design  mechanisms  to  match 
applications 

2.  Use  data  objects  of  designer 
specified  types 

3.  Object  oriented  design 

4.  Refinement 

5.  Constructive  Correctness 


Data  Abstraction  Facility 
Machine  ''Type"  Facility 

Data  Abstraction 


Machine  as  uni  ^  of  hi, '  ilarization 
Data  Abstraction 

Subprocedures 
Data  Type  Refiners 
Machine  Realizations 

Mapping  Function  for  Data  Types 
Weakest  Precondition  Semantics 


Linguistic  Support  for  Technical  Methods 

1 .  CSDL  supplies  simple  primitives  for  defining  datatypes  and  communication 
mechanisms.  These  primitives  do  not  presume  any  particular  mechanisms; 
rather,  they  give  the  designer  the  freedom  to  specify  and  subsequently 
design  mechanisms  that  are  most  appropriate  for  the  problem  at  hand. 

2.  CSDL  encourages  designers  to  augment  built  in  types  with  high  level, 
application  oriented  types,  so  that  an  application  system  can  be  designed 
in  terms  of  the  most  meaningful  objects  for  the  application. 

3.  Object  orientation  conceives  of  a  system  as  a  collection  of  objects,  each 
of  which  performs  some  task,  cooperating  to  achieve  the  system's 
goals. CSDL  has  two  encapsulating  devices  —  machines  and  data  abstraction. 

4.  Refinement  -  adding  design  detail  in  a  rational  way  -  is  supported  with 
three  techniques: 

0  subprocedures  to  refine  algorithms, 

0  data  type  refiners  to  implement  data  type  specifications,  and 
o  machine  realizations  to  implement  machine  specifications. 

5.  Finally,  a  constructive  methodology  for  creating  designs  that  meet 
specifications  rather  than  testing  and  adjusting  designs  until  they  meet 
specifications  rationalizes  the  creative  process. 
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1.4  CONSTRUCTS  AND  NOTATION 

Language  is  the  concrete  tool  we  give  designers.  Any  mechanism,  such  as 
dynamic  process  creation,  or  object,  such  as  the  communication  port,  that 
the  language  cannot  express  will  not  be  in  designs.  Any  design  principle, 
like  Information  hiding,  that  the  language  does  not  allow  or  makes 
difficult  will  not  be  used.  So  our  goals  were  that  CSDL's  notation: 

0  meet  the  requirement  of  supporting  technical  personnel  in  creating, 
reasoning  about,  and  presenting  design  specifications  and 
descriptions  in  terms  of  the  implementing  technology, 

0  be  the  vehicle  for  carrying  out  a  constructive  design  process  that 
produces  an  implementation  blueprint, 

0  facilitate  unambiguous  communication  and  permit  verification  through 
rigorous  semantics  based  on  formal  models. 

We  also  strove  to  meet  general  language  design  goals:  readability, 
writeability,  intuitive  appeal  for  a  user  community  which  is  comfortable 
with  programming  languages,  and  succinctness  without  sacrificing  clarity. 
This  section  presents  a  notation  which  meets  the  three  technical  goals  and 
is  an  honest  but  imperfect  attempt  to  meet  the  human-engineering  goals. 

Section  1.4.1  briefly  explains  the  structure  of  a  system  definition  and 
mentions  the  CSDL  constructs  for  expressing  each  definition  element. 
Section  1.4.2  presents  the  description  language  for  expressing  sequential 
procedures,  and  the  specification  constructs  that  are  applicable  to 
sequential  procedures.  Section  1.4,3  presents  CSDL's  built-in  data  types 
and  its  facilities  for  defining  abstract  data  types.  The  section  deals 
with  both  ordinary  passive  data  and  with  the  active  data  types  that 
constitute  communication  objects.  Section  1.4.4  discusses  machines, 
the  building  blocks  for  concurrent  systems,  and  the  language  constructs 
needed  to  specify  interactions  among  autonomous  control  sites.  Section 
1.4.5  presents  CSDL's  documentation  format. 


1.4.1  System  Definition  Structure 

A  system's  definition  is  the  union  of  its  description  (what  components  it 
contains)  and  its  specification  (how  its  components  behave). 
Specifications  (assertions)  and  descriptions  (declarations  and  algorithms) 
are  interspersed  in  each  component's  definition.  A  system  definition  is 
said  to  be  correct  if  there  are  proofs  (in  some  sense)  that  the  system 
description  satisfies  the  specifications. 

CSDL's  notation  comprises  descriptive  constructs  for  stating  declarations 
and  algorithms  and  specification  constructs  —  atenporal  assertions  in  the 
first  order  predicate  calculus  and  temporal  assertions  in  a  variant  of  the 
Moszkowski-Manna  [MOSZ83]  temporal  logic  for  specifying  hardware  behavior. 


1.4.2  Procedures 

Procedures,  functions,  and  type  operations  are  described  in  an  algorithmic 
language  based  on  Dijkstra's  [DIJK76]  guarded  commands  and  specified  by 
atemporal  assertions  that  characterize  relations  among  data. 


1.4.2. 1  Algorithmic  Language 
The  algorithmic  constructs  are: 

MEANING  NOTATION 


T 

1 

no  operation  I 

1 

SKIP 

1 

1 

1 

sequence  I 

1 

<statement> 

;  <statement> 

1 

1 

assignment  I 

1 

<id  list>  := 

<expression  list> 

1 

1 

1 

procedure  invocation  I 

type  operation  invocation! 

1 

<id>  (<parameters>) 

<qualified  reference>  (<parameters>) 

1 

1 

1 

1 

1 

non-blocking  selection  j 

blocking  selection  1 

1 

IF  B1— >S1  0 
WHEN  B1— >S1 

. . .0  Bn — >Sn  FI 

Q...D  Bn— >Sn  END 

1 

1 

1 

1 

1 

non-blocking  repetition  1 
blocking  repetition  1 

1 

DO  B1— >S1  0 
WHENEVER  B1- 

. ..Dsn  — >Sn  OD 
->S1  0...0  Bn— >Sn  END 

1 

_l 

The  formal  semantics  of  these  constructs  (i.e.,  what  happens  when  one  of 
them  is  used)  are  given  by  a  semantic  function  called  the  weakest 
precondition  predicate  transformer.  These  semantics  are  presented  in 
[FRAN83a,  Chapter  6].  Informally: 

No-operation,  sequence  and  assignment  have  the  usual  meaning. 
Procedure  and  type  operation  invocation  suspend  the  caller  and 
transfer  control  to  the  invoked  procedure.  Procedures  may 
instantiate  objects;  upon  completion,  a  procedure's  temporary 
objects  disappear. 

Selection  and  repetition  are  nondeterministic  guarded  commands. 
Non-blocking  selection  and  repetition  have  the  semantics 
presented  in  [DIJK76]. 

Blocking  selection  and  repetition  can  test  the  same  conditions 
as  the  non-blocking  varieties  and  also  test  whether 
communication  events  (data  arrival  or  departure)  have  occurred. 

Their  guards  may  refer  to  active  or  passive  objects;  at  least 
one  guard  should  refer  to  an  active  object. 
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WHEN  blocks  until  some  guard(s)  is  (are)  true,  then  executes  the 
statement  associated  with  a  true  guard.  Since  It  Is  not 
required  that  any  of  the  guards  eventually  becomes  true,  the 
statement  may  wait  forever. 

WHENEVER  blocks  until  some  guard(s)  is  (are)  true,  executes  the 
statement  associated  with  a  true  guard,  returns  to  waiting  until 
some  guard(s)  is  (are)  again  true,  and  so  forth.  Once  it  is 
entered,  there  is  no  exit  from  this  statement. 

A  procedure  description  consists  of  some  optional  declarations  of  local 

objects  and  an  algorithm  described  using  these  constructs. 


1.4. 2. 2  Atemporal  Specification 

The  context  for  atemporal  specifications  is  an  external  view  of  procedures 
as  functional  relations  between  inputs  and  outputs.  In  that  context,  it 
is  useful  to  specify  facts  about  state,  such  as  preconditions,  and  facts 
about  behavior  such  as  the  precondition/postcondition  pair,  which  express 
a  procedure's  effect.  These  facts  are  specified  through  values  of  data 
objects  and  changes  in  those  values. 

CSDL's  atemporal  language  is  first  order  predicate  calculus  with 
extensions  such  as  a  facility  for  introducing  local  definitions  and 
convenience  features  like  a  case  construct  for  abbreviating  a  conjunction 
of  implications.  The  language  is  described  in  detail  in  [FRAN83b, 
Chapters  3-8].  We  divide  atemporal  assertions  into  "value  propositions" 
and  "transition  propositions".  Value  propositions  characterize  state  by 
asserting  static  relations  among  values  of  several  objects  (x  >  y)  or 
between  an  object  and  its  values  (x  <  10  OR  x  >  10).  Transition 
propositions  characterize  state  transitions  by  asserting  a  relation 
between  a  state  and  its  (not  necessarily  immediate)  successor  (X'=X  1). 

CSDL  uses  two  sorts  of  procedures,  machine  procedures  and  abstract  data 
type  type  operations.  Each  sort's  specification  may  contain  the  following 
elements: 

Name  (  <input  pararaeters>  )  RETURNS  <type  specification> 

PRE  <value  propositlon> 

POST  <transitlon  proposition> 

INVARIANT  <value  proposttion> 

BEHAVIOR  <assertion> 

END 

The  optional  RETURNS  clause,  which  is  part  of  the  procedure  description, 
is  used  for  value  returning  procedures. 

In  machine  procedure  specifications,  the  proposition  following  PRE  is  a 
precondition  which  specifies  the  permissible  machine  states  when  invoking 
this  procedure.  In  a  type  operation  specification,  PRE  expresses 


constraints  on  the  parameters  and  the  instance;  correct  operation  is 
guaranteed  if  the  precondition  is  satisfied.  The  precondition  can  define 
a  required  relation  among  the  procedure's  local  objects,  between  the 
procedure's  parameters  and  its  local  objects,  or  among  local  objects, 
parameters  and  global  objects.  For  a  type  operation,  these  objects  may  be 
elements  of  the  type's  conceptual  model  rather  than  actual  objects  in  its 
repres^iiting  structure. 

The  proposition  following  POST  is  a  postcondition  which  specifies  the 
relationship  between  the  state  of  the  machine  or  type  instance  at 
termination  and  the  parameters  and  state  of  the  machine  or  instance  at 
invocation.  If  there  is  a  return  value  then  the  relationship  between  it 
and  the  parameters  and  state  of  the  machine  or  instance  at  invocation  is 
also  specified. 

The  optional  INVARIANT  specifies  a  relationship  among  the  parameters  and 
global  data  of  the  procedure  or  type  operation  that  is  preserved  by  an 
execution.  It  must  be  guaranteed  that,  if  the  invariant  is  satisfied  when 
the  procedure  or  type  operation  is  entered,  then  it  will  be  satisfied  upon 
exit. 

The  optional  BEHAVIOR  section  allows  the  designer  to  express  any  useful 
information  about  the  procedure's  function  or  properties  that  is  neither  a 
precondition,  a  postcondition  nor  an  invariant.  An  atemporal  assertion 
can,  for  example,  express  a  resource  constraint.  A  procedure's 
performance  specifications  expressed  as  temporal  assertions  (see  Section 
1.4. 3. 4)  would  also  appear  in  its  BEHAVIOR  section. 

In  summary,  a  type  operation  or  procedure's  specification  is  a  collection 
of  atemporal  assertions  which,  minimally,  define  a  relationship  between 
input  and  output  states  together  with  the  constraints  on  the  input.  The 
intended  interpretation  is  that  when  a  procedure  is  invoked  with  the 
objects  and  parameters  satisfying  its  input  constraint,  it  is  guaranteed 
to  terminate  with  the  objects  in  a  state  correctly  related  to  the  input 
state. 


1.4.3  Data 

Data  objects  hold  all  the  information  a  system  uses.  Machine  data  are 
permanent ;  they  last  as  long  as  their  containing  machine  does,  though 
their  values  may  change  over  time  (for  example,  a  data  base).  Procedure 
data  are  transient;  they  come  into  existence  when  their  procedure  is 
instantiated  and  vanish  when  their  procedure  terminates. 

Data  objects  are  also  either  active  or  passive.  A  passive  object 
undergoes  a  value  change  only  when  a  procedure  in  its  containing  machine 
manipulates  it.  An  active  object  may  undergo  a  value  change  without  its 
containing  machine's  operating  on  it.  Intermachine  communication  occurs 
between  active  objects. 
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CSDL  provides  some  built-in  passive  and  active  types,  and  a  type 
definition  facility. 


1.4. 3.1  Passive  Data 


An  instance  of  a  passive  data  type  changes  value  only  as  the  result  of  the 
invocation  of  some  procedure  or  operation  within  the  machine  containing 
it.  The  assumption  of  passivity  of  objects  lies  at  the  heart  of  the  basic 
theories  for  reasoning  about  a  program  by  looking  at  its  effects  on  data. 
In  particular,  the  power  of  invariants  is  due  in  part  to  the  assumption  of 
passivity  of  the  objects  involved. 


1.4.3. 1.1  Built-in  Passive  Types 

CSDL  has  four  built-in  scalar  types:  Boolean,  Char,  Integer  and  Real. 
Type  Boolean  has  the  usual  value  set  (TRUE  and  FALSE)  and  operations  (NOT, 
AND,  OR,  COR,  XOR,  CAND).  Type  Char  has  two  operations,  "assignment”  and 
"equality  test",  and  no  pre-defined  value  set.  Designers  can  define  its 
value  set  to  suit  the  intended  implementation  environment.  Type  Real 
denotes  the  mathematical  reals.  Type  Integer  denotes  the  integral  reals, 
so  every  Integer  data  object  is  also  a  Real.  CSDL  provides  the  usual 
numeric,  relational,  and  boolean  operations;  numeric,  relational  and 
boolean  expressions  are  formed  in  the  usual  way.  Initial  value 
declarations  are  allowed  for  all  scalar  types.  Using  the  abstract  data 
type  facility,  fixed  range  Integer  and  Real  subtypes  can  be  defined. 

CSDL  provides  four  constructed  types:  enumerated  types,  records, 
discriminated  unions  and  arrays. 

An  enumerated  type  is  a  finite  set  of  elements;  each  element's  only 
property  is  its  name.  There  are  both  unordered  and  ordered  enumerated 
types.  All  of  the  relational  operators  (<,  >,  !>>»=)  ai'e  defined  on 
elements  of  ordered  enumerated  types  but  only  the  relational  operation 
equality  (and,  therefore,  inequality)  is  defined  on  elements  of  unordered 
enumerated  types.  Assignment  is  defined  on  all  enumerated  types. 

A  record  data  object  consists  of  a  fixed  finite  number  of  data  objects, 
called  fields,  which  may  be  of  different  types.  CSDL  records  are  similar 
to  records  or  structures  in  a  number  of  Algol-like  programming  languages. 
Initial  value  declarations  are  allowed  for  entire  records  and  for  record 
fields. 

Discriminated  unions  provide  a  facility  for  working  with  data  objects  that 
may  contain  values  whose  type  is  one  of  a  finite  set  of  types.  They  are 
similar  to  variant  records  in  Pascal.  A  discriminated  union's  tag  field 
is  set  automatically  whenever  its  value  field  is  changed.  The  tag  field 
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cannot  be  manipulated  by  any  other  means.  The  value  field  may  be  of  any 
type. 

CSDL  provides  Dljkstra  arrays  as  the  sequence  abstraction.  An  array  data 
object  consists  of  a  set  of  data  objects,  all  of  the  same  type,  that  is 
indexed  by  a  contiguous  range  of  integers;  the  set  of  objects  may  be 
empty.  An  array's  size  varies,  shrinking  when  the  object  with  the  largest 
or  smallest  index  is  deleted  and  growing  when  an  object  with  an  index  that 
is  one  more  than  the  highest  index  or  one  less  than  the  smallest  index  is 
added . 

CSDL  provides  the  array  attributes  and  operations  proposed  in  [DIJK76]. 
The  special  array  attributes  are: 

(hibilob)  -  an  integer  identifying  the  ( largest  I  smallest)  index  of  the 
array . 

dom  -  an  integer  Identifying  the  size  of  the  array. 

The  special  array  operations  are: 

(high  I  low) _ex tend  -  a  function  which  adds  a  new  value  to  the  (top  I  bottom) 

of  the  array,  increases  dom  by  one,  and  (increases 
hlb I  decreases  lob)  by  one, 

(high I low)_^remove  -a  function  which  removes  a  value  from  the  (topibottom) 
”  of  the  array,  decreases  dom  by  one,  and  (decreases 

hibi increases  lob)  by  one, 

assignment  -  of  a  value  to  an  arbitrary  array  element,  or  of  values 

to  an  entire  array  with  an  array  constructor. 

access  to  arbitrary  array  elements  -  in  the  usual  programming  language 
manner . 

Dijkstra  arrays  are  more  general  than  the  usual  programming  language 
arrays,  so  they  allow  designers  to  describe  more  general  information 
structures  such  as  files,  databases  or  dynamic  memory. 


1.4.3. 1.2  Abstract  Data  Types 

When  a  language  allows  designers  to  augment  its  built  in  types  with  high 
level,  application  oriented  types,  designers  can  work  in  terms  of  the  most 
meaningful  objects  for  the  application.  For  example,  in  a  compiler 
design,  it  is  more  meaningful  to  manipulate  a  symbol  table  object  than  to 
manipulate  the  more  primitive  objects  that  provide  the  symbol  table. 
Furthermore,  the  benefits  in  complexity  management  of  separating  the  use 
of  a  high  level  type  from  its  definition  are  well  documented  in  [LISK75, 
LISK79a,  WULF76].  CSDL's  abstract  data  type  facility  is  a  major  design 
rationalization  and  complexity  management  feature. 
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A  user-defined  abstract  data  type  Is  a  set  of  values  and  a  set  of 
permissible  operations  on  those  values.  User  defined  abstract  types 
always  have  a  definition  which  documents  their  externally  visible  behavior 
and  externally  accessible  operations.  If  the  designer  decides  there  are 
no  design  issues  involved  in  building  instances  of  the  type,  no  further 
design  is  required.  If  the  designer  decides  there  are  design  issues,  then 
further  design  work  is  called  for.  The  design  of  the  type's  representing 
data  structure  and  operation  implementations  are  packaged  in  a  unit  called 
a  refiner,  which  contains  the  internal  details  that  implement  the  external 
view  and  operations. 


1.4. 3. 1-2.1  Type  Definitions 

Abstract  types  are  defined  using  abstract  model  definitions,  which  are 
considered  more  understandable  and  easier  for  designers  to  construct  than 
axiomatic  definitions  [LISK79b].  An  abstract  type  definition  consists  of 
a  model  of  the  value  set,(1)  specifications  of  the  allowable  operations  on 
the  value  set,  and  optional  INITial  and  INVARIANT  specifications. 

The  MODEL  presented  in  every  type  definition  is  a  device  with  which  to 
express  the  specifications  of  the  type's  behavior.  This  conceptual  model 
has  nothing  to  do  with  how  the  type  is  eventually  implemented.  Its 
purpose  is  to  give  users  of  the  type  a  picture  of  how  the  type  behaves  and 
what  type  operations  accomplish.  However,  there  is  nothing  to  prevent  a 
type's  representing  structure  from  being  the  same  as  its  conceptual  model. 

CSDL  has  two  kinds  of  type  operations:  ofuns  which  change  the  object's 
state  and  may  return  a  value,  and  vfuns  which  return  a  value  but  do  not 
change  the  object's  state  [ROBITTTI  If  a  design  undergoes  formal 
verification,  either  in  conjunction  with  its  construction  or  after  it  is 
complete,  only  vfuns  and  not  value  returning  ofuns  may  be  used  in 
expressions  in  the  guards  of  IF,  DO,  WHEN  and  WHENEVER  statements,  because 
the  semantics  of  these  statements  require  that  guard  expression  evaluation 
be  side-effect  free. 

Type  operation  specifications  were  explained  in  Section  4.1.2.  A  type 
definition  contains  only  the  type  operation  specifications,  which  are 
presented  in  terms  of  the  type's  conceptual  model.  The  type's  refiner 
contains  operation  descriptions. 

INITial  states  can  be  specified  for  an  abstract  data  type's  conceptual 
object  space.  An  INIT  assertion  specifies  a  desired  relationship  among 


(1)  A  model  is  a  collection  of  typed  objects,  for  example, 

STACK  (T:TYPE)  IS 

MODEL  x:  T  ARRAY,  tos: INTEGER 

The  MODEL  objects'  types  indicate  value  sets  for  those  objects;  the 
operations  defined  on  those  types  are  not  exported  to  the  type  under 
specification  as  permissible  operations  on  that  type. 
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the  values  of  these  objects.  The  intended  interpretation  is  that,  when  a 
type  instance  comes  into  existence,  its  objects  are  guaranteed  to  satisfy 
the  assertion. 

An  INVARIANT  is  a  property  of  a  type's  MODEL  established  when  an  instance 
of  the  type  is  created.  Each  type  operation  must  have  the  property  that, 
an  instance  satisfies  the  invariant  and  the  operation  is  invoked  so 
that  its  input  constraint  is  satisfied,  the  instance  will  satisfy  the 
invariant  when  the  operation  terminates.  However,  abstract  type  operation 
implementations  may  violate  a  type's  invariant  while  they  are  in  progress 
because  the  invariant  is  a  guarantee  to  the  type's  users  in  their  scope, 
not  inside  the  type's  scope. 

CSDL  also  provides  generic  abstract  types.  A  type  definition  may  contain 
an  optional  parameter  list  consisting  of  pairs  of  the  form  <id  list>:<type 
specification>  or  <id  List>:TYPE.  These  parameters  can  be  instantiated 
when  an  instance  of  the  type  is  declared;  they  particularize  a  generic 
abstract  type  to  an  abstract  type.  The  values  of  these  parameters  remain 
constant  for  the  type  instance's  lifetime.  <id  list>:<type  specification> 
specifies  formal  values.  The  <id>s  in  <id  list>  may  appear  anywhere  that 
a  value  may  appear,  for  example,  in  an  assertion.  <type  specif ication> 
denotes  a  standard  or  user  defined  type.  <id  list>:TYPE  specifies  a  list 
of  formal  names  of  known  <type  specification>s.  Formal  TYPE  parameters 
may  occur  anywhere  in  the  type  definition  that  a  type  designator  is 
required,  for  example,  in  the  conceptual  object  space  declarations  and  in 
operations'  parameter  lists  or  return  clauses. 


1.4.3. 1-2.2  Type  Refiners 

A  refiner  is  the  package  that  contains  the  concrete  decisions  about  how  to 
represent  an  instance  of  an  abstract  data  type  and  implement  its 
operations.  A  refiner  must  contain: 

0  the  data  structure  chosen  to  represent  an  instance  of  the  type, 

0  one  procedure  for  each  operation  defined  on  the  type, 

0  a  mapping  function  that  defines  how  the  data  structure  corresponds 
to  the  model  declared  in  the  type  definition. 

A  refiner  is  not  a  machine;  it  does  not  constitute  a  (potentially) 
asynchronous.  Independent  locus  of  control.  Rather,  a  refiner  can  be 
thought  of  as  a  set  of  templates  of  data  structures  and  procedures. 
Instances  of  these  templates  replace  uses  of  objects  of  the  type  being 
refined.  Each  declared  object  of  the  type  may  be  thought  of  as  being 
replaced  by  a  distinct  copy  of  the  representing  data  structure,  and  each 
operation  invocation  may  be  thought  of  as  being  macro-replaced  by  an 
invocation  of  the  refiner-procedure  of  the  same  name. 

The  operations  specified  in  a  type  definition  are  implemented  by 
procedures  with  the  same  names  as  these  operations.  This  establishes  the 
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relationship  between  a  definition's  operations  and  the  refiner's 
\tnplementing  procedures . 

The  relation  between  a  type  definition's  conceptual  object  space  and  the 
refiner's  representing  structure  is  defined  by  a  mapping  function  from  the 
objects  of  the  lower-level  type  refiner  onto  the  conceptual  object  space 
of  the  data  type  to  be  represented.  The  mapping  function  is  defined 
within  the  refiner  because  that  is  the  only  place  where  the  lower  level 
objects  are  visible. 

The  mapping  function  is  used  to  uniformly  substitute  lower-level  objects 
for  upper-level  objects  in  the  type's  specifications.  The  resulting 
assertions  constitute  specifications  of  initial  states,  data  invariants, 
and  procedure  specifications  which  must  be  satisfied  by  the  representing 
data  structure  and  the  procedures  which  implement  the  type  operations  in 
order  for  the  refinement  to  be  a  consistent  representation  of  the  data 
abstraction. 


1 .4.3.2  Active  Data 

The  idea  and  power  of  passivity  of  objects  fits  well  with  a  single  machine 
"running"  in  isolation.  When  a  machine  is  put  in  association  with  other 
machines,  and  interacts  with  them,  things  become  more  complicated.  One 
machine.  A,  can  affect  or  interact  with  another  machine  B  only  by  somehow 
changing  the  value  of  one  of  B's  objects.  From  the  point  of  view  of 
machine  B,  some  of  its  objects  have  "spontaneously"  changed  state;  thus 
they  are  not  passive.  An  active  object  is  one  that  can  exhibit  a  state 
change  that  is  not  the  result  of  an  operation  or  procedure  invoked  upon  it 
within  the  machine  that  contains  it.  When  machine  A  causes  a  spontaneous 
state  change  in  B,  there  is  a  flow  of  information  from  A  to  B.  The 
mechanism  by  which  one  CSDL  machine  can  affect  another  involves  a  pair  of 
complementary  active  data  objects  and  their  connection. 

Every  active  type  is  an  abstract  data  type  whose  model  is  composed  of 
passive  objects.  It  may  be  scalar  or  structured.  Communication  paths 
among  CSDL  machines  are  formed  by  connecting  instances  of  complementary 
active  types.  Often  the  models  for  each  of  a  pair  of  complementary  types 
are  identical,  but  each  has  a  different  set  of  operations.  Some 
operations  (for  example,  receive)  absorb  information  into  a  machine's 
space;  others,  (for  example,  send)  emit  it  into  a  machine's  environment. 
The  role  of  an  active  type  (emitter  or  absorber)  is  determined  by  the 
type's  operations,  not  its  model.  Structured  types  may  even  play  the  role 
of  emitter  and  absorber.  For  example,  each  end  of  a  full  duplex  channel 
could  be  specified  as  the  same  active  type  whose  model  consists  of  two 
data  fields,  with  operations  to  emit  through  one  field  and  absorb  through 
the  other.  The  channel  is  constructed  by  connecting  one  object's 
"emitter"  to  the  other's  "absorber,"  and  vice  versa. 
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The  fact  that  every  active  type  is  an  abstract  data  type  allows  the 
definition  of  various  forms  of  blocking  and  non-blocking  send  and  receive 
primitives,  and  ack/nack  and  time-out  protocols. 


1.4. 3 >2.1  Complementarity 

The  idea  of  complementary  types  is  based  on  an  intuitive  "plug-and-socket" 
idea.  Each  member  of  a  set  of  machines  has  a  piece  of  communication 
equipment,  an  object  of  an  active  type.  VHien  the  pieces  are  plugged  into 
one  another  they  form  a  plex  over  which  the  machines  may  interact  with  one 
another.  The  various  pieces  may  be  structured,  so  there  may  be  several 
ways  in  which  the  components  of  a  piece  could  be  mated  with  the  components 
of  other  pieces.  Thus  there  must  be  a  specification  of  the  one  desired 
way  of  mating  the  components. 

We  formalize  these  ideas  in  CSDL  using  the  notion  of  active  type  MODEL 
compatibility.  Two  objects  are  compatible  if  they  are  of  the  same  type. 
Two  sets  of  objects  are  compatible  if  they  can  be  put  into  a  one-to-one 
correspondence  so  that  the  corresponding  pairs  of  objects  are  compatible. 
Two  sets  are  said  to  be  complemented  when  a  one-to-one  correspondence 
between  them  has  been  specified.  In  the  simplest  case,  complementary 
types  may  be  built  up  by: 

o  defining  an  active  type's  MODEL  as  a  set  of  passive  objects, 

0  defining  its  complement's  MODEL  as  a  compatible  set  of  objects,  and 

o  complementing  the  two  sets,  that  is,  specifying  the  desired  one-to-one 
correspondence. 


More  generally,  a  non-empty  subset  of  one  active  type's  MODEL  objects  is 
made  compatible  with  a  non-empty  subset  of  another's  MODEL  objects,  and 
these  two  subsets  are  complemented.  This  allows  an  active  type's  MODEL  to 
contain  objects  that  are  available  for  specification  purposes  but  do  not 
participate  in  connections  with  other  active  objects. 

Complementary  types  should  be  designed  in  tandem;  the  design  process  will 
produce  a  pair  whose  coupled  effect  is  the  communications  protocol  the 
designer  is  aiming  for.  However,  each  of  the  complementary  types  will  be 
documented  separately;  each  type's  specification  will  contain  a 
COMPLEMENTS  specification  that  expresses  the  complementary  relationship 
between  the  elements  of  that  type's  MODEL  and  the  elements  of  its 
complement's  MODEL. 

For  example,  an  active  type  T  could  be  modeled  as  having  two  components,  x 
and  y,  of  type  A,  and  one  component,  z,  of  type  B.  Another  type  U  could 
be  modeled  as  having  two  components,  p  and  r,  of  type  A  and  one  component, 
s,  of  type  B.  T  and  U  would  have  appropriate,  different,  operations. 
Types  T  and  U  are  compatible  because  there  exist  one-to-one 
correspondences  between  them  in  which  corresponding  pairs  are  compatible. 


116 


CSDL:  CONCURRENT  SYSTEM  DEFINITION  LANGUAGE 


{x< — >p,  y< — >r,  z< — >s}  and 
correspondences.  T  and  U 
correspondences  is  specified. 


{x< — >r,  y< — >p,  z< — >s}  are 
are  complements  once  one 


two  such 
of  these 


1.4. 3. 2. 2  Connection 

An  isolated  member  of  a  complementary  set  of  active  objects  is  useless;  in 
fact,  members  behave  passively  in  isolation.  However,  once  they  are 
connected  the  resulting  plex  can  exhibit  active  behavior;  information  can 
flow  from  object  to  object,  and  hence  from  machine  to  machine.  The 
COMPLEMENTS  specifications  of  the  complementary  types  provide  the 
semantics  of  connection.  Those  semantics  are  that  each  component  of  one 
object  is  associated  with  its  complement  in  the  connected  object.  The 
nature  of  the  association  is  that  the  components  have  the  same  value 
throughout  the  period  of  the  connection. 

It  is  now  clear  how  active  behavior  is  obtained  in  a  connected  set  of 
complementary  active  objects  —  the  invocation  of  an  operation  that 
changes  a  component  of  one  object  immediately  changes  the  state  of  the 
components  paired  to  it  in  other  objects;  the  machines  containing  these 
other  objects  see  spontaneous  state  change. 

For  the  example  in  4. 2. 2.1,  an  object  of  type  T  could  be  connected  to  an 
object  of  type  U  in  either  of  the  configurations  shown  below. 


type  T 

type  U 

type  T 

type  U 

,:A 

]  p:A 

x:A  ] - 

- >  ]  p;A 

y:A 

^  ]  r:A 

y:A  ] - 

2:B  [  < - 

-  [  s:B 

■ 

1 

1 

1 

1 

V 

CQ 

•  • 

M 

Complementarity  of  Types  T  and  U 


1.4. 3. 2. 3  Inlets  and  Outlets 

CSDL  has  two  predefined  active  types.  One,  the  outlet,  allows  a  machine 
to  send  information  to  its  environment;  the  other,  the  inlet,  allows  a 
machine  to  receive  information  from  its  environment  [BOEB78].  These  types 
are  structured;  their  conceptual  model  is  a  record  with  two  fields:  a 
window  that  holds  information  of  some  type,  and  a  Boolean  flag.  Inlet  and 
outlet  flags  make  transmission  and  communication  detectable.  Without 
flags,  detection  by  comparing  old  and  new  window  values  fails  when 
identical  values  are  transmitted  on  the  1-th  and  i-fl-th  transmission. 
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Each  type  has  two  allowable  operations  which  may  be  performed  only  by 
procedures  in  a  machine  containing  the  object  of  the  type.  The  inlet  has 
a  '•get"  for  reading  information  and  a  "came"  for  testing  its  flag.  The 
outlet  has  a  "put"  for  writing  information  into  it,  and  a  "went"  for 
testing  its  flag.  These  operations  are  invoked  only  by  procedures  in  the 
machine  containing  the  active  object. 

An  inlet  is  an  object  from  which  information  may  be  extracted  by  a  get 
operation.  A  datum  arriving  from  outside  the  inlet's  containing  machine 
will  set  its  window  to  the  arriving  value  and  its  flag  so  that 
inlet. came=TRUE.  The  first  get  performed  after  the  arrival  of  a  datum 
results  in  a  new  value  being  obtained;  all  gets  after  the  first  and  before 
the  next  data  arrival  return  the  same  value  as  the  first  get.  This  first 
get  also  resets  the  flag.  Get,  came,  and  data  arrival  are  indivisible 
actions;  a  flag  change  cannot  overlap  the  invocation  of  an  operation. 
This  rule  constitutes  the  only  guaranteed  synchrony  constraint. 

An  outlet  is  an  object  into  which  information  may  be  deposited  by  an 
invocation  of  a  put  operation.  Some  time  after  invoking  "put,"  an 
outlet's  flag  will  spontaneously  change  so  that  outlet. wentsTRUE  IT 
inlet. get  is  applied  to  its  connected  inlet.  Only  the  last  value  put 
before  a  change  to  true  will  be  communicated.  Put,  went  and  the  flag 
change  are  indivisible,  so  the  flag  change  cannot  overlap  the  invocation 
of  an  operation;  this  constitutes  the  only  guaranteed  synchrony 
constraint. 

The  communication  model  based  on  connected  inlets  and  outlets 
distinguishes  between  transmission  and  communication.  Transmission 
between  an  outlet  and  a  matching  inlet  is  Instantaneous;  transmission  is 
putting  a  value  in  an  outlet.  Putting  sets  the  outlet's  flag,  the 
matching  inlet's  flag,  and  puts  the  value  into  the  outlet's  and  matching 
inlet's  windows.  Communication  happens  when  a  value  is  got  by  a  receiving 
machine.  Getting  absorbs  a  value  into  a  machine's  local  space,  and  resets 
the  flag  on  both  the  inlet  from  which  the  value  is  got  and  the  matching 
outlet.  Getting  does  not  change  the  value  in  either  object's  window. 


1.4.4  Machines 

Concurrent  systems  are  designed  as  collections  of  concurrently  active, 
asynchronously  communicating  modules  called  machines  in  CSDL.  These 
machines  are  Instances  of  machine  types. 

A  machine  is  a  collection  of  data  objects  and  a  sequential  procedure  that 
manipulates  those  objects.  The  sequential  procedure.  Controller,  may 
invoke  subprocedures.  A  machine  may  contain  objects  that  are  themselves 
machines;  in  that  case,  the  submachines  operate  concurrently  with  each 
other  and  with  their  parent,  and  each  manipulates  a  disjoint  subset  of  the 
parent  machine's  data  objects.  Each  machine  accomplishes  a  task.  When  a 
machine  contains  submachines,  that  task  is  accomplished  by  the  parent 
machine  and  the  collection  of  submachines. 
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Like  abstract  data  types,  machine  types  have  a  definition  and  a 
realization.  A  machine  definition  documents  a  machine  type's  externally 
visible  data  objects  and  behavior.  A  machine  realization  gives  the 
internal  details  that  implement  that  external  view.  The  machine  type  is 
more  limited  than  data  types  since  there  are  no  explicit  operations 
defined  on  objects  of  type  machine.  Machine  instances  are  created  and 
destroyed  in  controlled  ways. 


1.4. 4.1  Machine  Definitions 

A  machine  definition  consists  of  a  list  of  the  machine's  public  objects 
and  specifications  of  the  machine's  externally  visible  behavior. 

Public  objects  are  those  (active  and  passive)  machine  objects  which  define 
the  external  view  of  the  machine.  A  machine's  realization  is  guaranteed 
to  have  these  objects.  A  machine  communicates  with  its  environment 
through  its  active  public  objects.  Its  passive  public  objects  are  visible 
to  the  environment,  but  cannot  be  manipulated  by  it.  Public  objects  are 
used  in  specifications  of  the  machine's  externally  visible  behavior. 

Machine  specifications  may  specify  Initial  values,  invariant  properties 
and  machine  behavior. 

An  INITIAL  assertion  specifies  allowable  initial  values  of  machine  objects 
for  every  machine  instance  of  the  type;  an  implementation  must  guarantee 
that  an  instance  will  satisfy  the  assertion  when  it  is  created.  An 
INVARIANT  assertion  specifies  a  property  of  the  machine's  objects  which  is 
satisfied  when  the  instance  is  created  and  which  is  preserved  at  each 
state  transition  the  machine  undergoes.  Procedure  boundaries  inside  a 
machine  are  transparent  with  respect  to  a  machine  invariant;  each 
statement  in  every  procedure  preserves  the  invariant.  The  invariant  may 
be  violated  inside  a  type  operation,  but  type  operations  are  atomic  from  a 
machine's  point  of  view,  so  the  invariant  is  still  preserved  from  the 
machine's  point  of  view.  BEHAVIOR  assertions  specify  requirements  and 
constraints  on  the  machine's  function  and  performance.  These  are 
atemporal  and  temporal  assertions.  Temporal  assertions  are  explained  in 
section  4.3.3» 


1.4. 4. 2  Machine  Realizations 

A  machine  realization  opens  the  black  box  machine  definition.  It  is  a 
package  containing  the  concrete  decisions  about  how  to  implement  a 
machine's  observable  behavior.  A  realization  must  contain: 

0  the  machine's  public  objects, 

0  the  machine's  controller,  a  distinguished  procedure  which  is  never 
invoked  but  starts  executing  when  the  machine  is  created.  One  typical 
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controller  structure  is  a  prologue  block  which  initializes  the  machine 
objects  to  some  required  state,  followed  by  a  loop  for  repeated 
scanning  of  inlets.  In  this  loop,  data  arrival  is  responded  to  by 
invoking  other  procedures  and  sending  data  out  through  selected 
outlets.  If  this  controller  ever  terminates,  there  can  be  no  further 
response  to  data  sent  to  it  from  other  machines;  also  passive  objects 
in  the  machine  can  no  longer  change  state. 

A  realization  may  also  contain: 

0  private  data  types  and  objects, 

o  machine  and  machine  pool  objects, 

o  specifications  about  internally  visible  behavior  and  performance,  and 
about  the  relations  between  public  and  internally  visible  objects, 

0  subprocedures. 


1.4. 4. 3  Dynamism 

CSDL  is  intended  for  designing  systems  with  inherent  concurrency  (for 
example,  geographically  distributed  systems),  systems  in  which  concurrency 
is  needed  to  deliver  adequate  performance,  or  for  which  expressing  the 
design  as  collection  of  concurrent  modules  leads  to  a  simpler,  more 
understandable  design. 

There  are  two  basic  concurrent  architectures:  the  static  architecture  in 
which  the  system  is  created  with  a  fixed  number  of  modules  which  persist 
throughout  its  lifetime,  and  the  dynamic  architecture  in  which  modules  are 
created  as  needed  to  handle  new  tasks.  CSDL  supports  them  both. 

In  CSDL,  the  basic  locus  of  control  is  the  machine.  The  machine  is  a 
container  of  objects  and  a  control  procedure  in  execution.  A  machine  may 
contain  data  objects  of  any  type.  A  machine  may  also  contain 
machine-objects,  that  is,  other  machines  in  operation,  and  pools  of 
machine-objects  from  which  operating  machines  may  be  created  and 
destroyed.  These  structures  (the  machine-object  and  the  pool  of 
machine-objects)  enable  a  single  machine  to  contain  several  concurrently 
operating  local  loci  of  control. 


1.4. 4. 3.1  Machine  Creation 

A  CSDL  system  is  a  machine;  a  concurrent  system  is  a  machine  which 
contains  other  machines.  A  system's,  that  is,  a  top  level  machine's, 
initial  architecture  comprises  a  collection  of  machines,  each  declared  as 
an  object  in  the  object  space  of  the  root  machine,  SYSTEM.  Each  of  these 
machines  may  contain  machines,  and  so  forth.  Vfhen  the  system  is 
instantiated,  all  machines  in  SYSTEM'S  .'Object  space  are  instantiated. 
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communication  links  are  forged,  and  they  are  set  in  operation.  If  these 
machines  contain  machine  declarations,  the  same  scenario  is  repeated. 
This  is  static  machine  creation.  Machines  created  in  this  way  cannot  be 
destroyed  explicitly;  they  cease  to  exist  if  the  machine  containing  their 
declaration  is  destroyed. 

When  a  machine  containing  pool  declarations  is  instantiated,  empty  pools 
are  created.  During  the  machine's  lifetime,  its  procedures  may  explicitly 
create  and  destroy  pool  elements.  This  is  dynamic  machine  creation. 
Explicitly  created  machines  are  linked  to  their  containing  context  as 
specified  in  an  argument  of  the  create  operation.  If  the  machine 
containing  the  pool  objects  ceases  to  exist,  its  explicitly  created  child 
machines  cease  to  exist  because  the  pools  that  hold  them  no  longer  exists. 

Machines  created  statically  or  dynamically  are  wholly  contained  within 
their  creating  (parent)  machine. 


1.4. 4. 3. 2  The  Need  for  Pools 

Pool  structures  are  variable  size  collections  of  objects  of  some  single 
machine  type.  (CSDL  allows  pools  of  machines  only.)  The  collections  are 
indexed  by  pool-unique  names. 

There  are  two  operations  on  pools:  "create,”  which  adds  an  object  of  the 
machine  type  to  the  pool,  and  "destroy,"  which  removes  the  object  named  by 
its  index  from  the  pool.  It  is  also  possible  to  select,  or  refer  to,  a 
particular  element  of  the  pool,  and  to  ascertain  the  size  of  the  pool, 
that  is  the  number  of  operating  machines  currently  in  the  pool. 

CSDL  provides  dynamic  machine  creation  and  destruction  to  meet  the  real 
world  requirement  for  dynamic  process  creation  and  destruction.  CSDL  puts 
dynamically  created  machines  in  pools  to  meet  its  goals  of  facilitating 
reasoning  about  designs  and  design  verification.  A  pool's  size  attribute 
permits  the  specification  of  resource  constraints  ("This  pool  contains  no 
more  than  20  machines"),  reasoning  about  pool  size  during  design,  and 
verification  that  a  description  satisfies  specified  bounds  on  resources. 
Of  course,  a  design  can  contain  pools  whose  bounds  are  not  specified. 


1.4. 4. 3. 3  The  Role  of  Public  Objects 

Every  public  object  is  part  of  a  machine's  externally  visible  state,  but 
public  objects  serve  different  roles  in  a  machine  design.  Public  objects 
may,  in  addition  to  showing  the  external  machine  state: 

-  realize  partitioning,  when  an  object  in  the  parent  machine's  object 
space  is  manipulated  by  a  child  machine  in  order  to  accomplish  part 
of  the  parent's  specified  task; 
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serve  as  communication  ports,  if  they  are  active  objects  and  if  they 
are  linked  to  complementary  objects  in  their  environment. 


Public  objects  which  only  show  visible  state  are  never  linked  with  the 
environment.  Public  objects  which  realize  partitions  are  bound  to  objects 
in  their  environment.  Public  objects  which  serve  as  communication  ports 
are  connected  to  objects  in  their  environment. 

A  machine's  public  objects  are  not  intrinsically  bindable  only, 
connectable  only  or  unlinkable.  The  same  public  object  may  be  bound  in 
one  instance  of  a  machine  type,  connected  in  a  second,  and  unlinked  in  a 
third.  The  disposition  of  each  public  object  is  determined  at  machine 
creation  by  the  initializing  "linking  specification"  that  appears  in  a 
machine  object  declaration  or  as  a  parameter  to  a  create  operation. 
Public  objects  that  are  not  mentioned  are  unlinked  at  machine  creation; 
they  may  remain  unlinked  for  the  machine's  lifetime  or  may  later  be 
connected  (but  not  bound)  to  complementary  public  objects  in  some  newly 
created  machine. 


1 . 4 . 4 . 3 . 4  Communication 

Dynamic  system  restructuring  is  complete  only  when  a  newly  created  machine 
is  tied  in  to  the  rest  of  the  system.  This  section  discusses  mechanisms 
for  accomplishing  that  linking  and  presents  the  information  flow  issues 
involved. 

Because  CSDL  is  intended  for  designing  operating  system  type  applications, 
it  must  be  able  to  express  a  range  of  communication  options  from  paired, 
blocking  send/reply  through  third  party  reply  to  non-blocking  send  and 
receive.  One  language  facility  that  gives  CSDL  the  flexibility  to  express 
many  communication  mechanisms  is  that  it  has  two  linking  modes:  binding 
and  connection. ( 1 ) 

A  machine  influences  its  environment  by  means  of  public  objects  that  are 
linked  to  the  environment  by  binding  or  connection.  Unlinked  public 
objects  cannot  influence  the  environment.  A  machine  whose  public  objects 
are  all  unlinked  is  effectless. 


( 1 )  An  equally  important  factor  in  accomodating  a  range  of  communication 
options  is  that  CSDL's  communication  objects  and  primitives  do  not  support 
a  particular  communication  mechanism  as,  for  example,  Argus  [LISKSl] 
supports  send/reply  and  CSP  [HOAR78]  supports  rendezvous.  CSDL's 
communication  objects  are  inlets  and  outlets  which  can  serve  as 
communication  ports  and  from  which  more  complex  abstract  active  objects 
can  be  built.  CSDL's  plug  and  socket  communication  model  is  neutral  with 
respect  to  the  kinds  of  communication  and  synchronization  mechanims  the 
application  under  design  contains.  The  application's  designer  builds  the 
required  commmunication  protocols  and  synchronization  mechanisms  from 
these  simple,  neutral  facilities. 
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1.4.4.3-4.1  Linking 

Communication  links  may  be  forged  only  between  newly  created  machines  and 
machines  that  already  exist.  There  are  two  verities  of  linking; 

1.  Binding  -  in  which  a  parent  machine  identifies  one  of  its  actual 
objects  with  a  child  machine's  public  object,  giving  up  access  to  that 
object  for  the  duration  of  the  child's  lifetime.  Both  active  and 
passive  objects  may  be  bound;  binding  occurs  between  objects  of  the 
same  type.  Binding  is  similar  to  parameter  passing  ^  reference,  with 
the  child's  public  object  acting  as  the  formal  parameter  and  the 
parent's  object  as  the  actual  parameter.  After  binding,  an  object 
which  belonged  to  the  parent  belongs  to  its  child.  Only  one  object 
exists  in  the  system;  control  over  it  shifts  from  parent  to  child  and 
it  is  a  semantic  error  for  a  parent  to  access  or  modify  a  bound  object 
during  its  child's  lifetime.  That  actual  object's  value  is  unchanged. 
Hence,  the  result  of  binding  is  that  the  child's  public  object  is 
initialized  to  be  the  value  of  the  object  to  which  it  is  bound. 

There  can  be  only  one  binding  between  a  child's  public  object  and  a 
parent's  object  for  the  created  machine's  entire  lifetime.  Bindings 
are  broken  only  when  a  machine  is  destroyed. 

2.  Connection  -  in  which  public  objects  in  child  machines  are 

"actualized"  and  the  parent  machine  declares  information  flow  paths 
among  them  or  between  some  of  them  and  its  own  objects.  Only  active 
objects  are  connected,  and  connection  occurs  between  complementary 
objects:  an  inlet  is  connected  to  an  outlet  and  vice  versa.  Unlike 
binding,  two  objects  are  needed  to  forge  a  connection. 

All  binding  is  done  at  machine  creation.  Connection  takes  place  only  in 
the  context  of  creation.  Connection  is  done  between  a  newly  created 
machine  and  an  already  existing  sibling  or  between  a  newly  created  machine 
and  its  parent.  Hence,  whenever  a  connection  is  made  between  two 
machines,  at  least  one  of  them  is  in  the  process  of  being  created;  it  is 
impossible  to  connect  two  machines  if  both  of  them  already  exist. 


1.4. 4. 3. 4. 2  Information  Flow  When  Forging  Communication  Links 


Connecting  a  complementary  set  of  active  objects  establishes  information 
flow  paths  among  their  containing  machines.  The  semantics  of  connecting 
complements  is  that  their  complementary  parts  effectively  merge  into  one, 
so  that  parts  of  like  types  will  have  the  sane  values  for  the  lifetime  of 
the  connection.  Since  this  identity  is  established  at  the  moment  the 
connection  is  made,  there  will  be  a  one-time  flow  of  information  (usually 
garbage)  into  some  of  the  connected  machines  as  their  objects  undergo 
apparently  spontaneous  state  changes.  To  avoid  injecting  garbage  values 
into  the  state  space  of  a  machine  in  operation,  we  say  that  the  active 
object  in  the  machine  being  created  is  "assigned"  the  value  of  the 
complementary  object  in  the  machine  being  connected  to  it.  Since  every 
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communication  link  is  forged  between  a  newly  created  machine  and  an 
already  existing  one,  this  convention  insures  that  values  do  not  flow  into 
a  machine  already  in  operation.  It  is  always  possible  to  identify  which 
member  of  an  information  flow  path  belongs  to  a  newly  created  machine. 
This  semantic  for  connection  insures  that  we  can  reason  about  a  machine's 
behavior  and  properties  in  Isolation. 

A  machine  may  have  an  INIT  or  INVARIANT  assertion  which  must  be  satisfied 
upon  machine  creation.  When  connection  forging  injects  values  into  a 
newly  created  machine,  those  values  must  satisfy  the  machine's  INIT  and/or 
INVARIANT  assertion. 

The  problem  of  garbage  information  does  not  arise  in  binding  because 
control  over  the  same  object  is  transferred  from  parent  to  child  and  this 
transfer  does  not  change  the  object's  value.  But  information  flow  does 
occur,  since  the  child's  public  object  and  the  parent's  object  become  one. 
The  parent  must  guarantee  that  its  object,  when  bound  to  a  child's  public 
object,  will  satisfy  the  child's  INIT  and/or  INVARIANT  specification,  if 
any.  In  practice,  it  is  safest,  either  to  bind  to  a  public  object  which 
is  not  mentioned  in  an  assertion,  or  bind  several  objects  to  a  set  of 
public  objects  that  participate  in  an  INIT  relation. 


1.4. 4. 4  Temporal  Specifications 

Temporal  specifications  are  needed  as  soon  as  the  notion  of  several 
machines  operating  concurrently  is  introduced.  When  two  or  more  processes 
progress  concurrently  and  Interact,  we  must  be  able  to  say  things  about 
that  progress  and  those  interactions.  Atemporal  specifications  of  the 
functional  relation  between  a  machine's  inputs  and  outputs  are  not 
sufficient  to  talk  about  computational  progress  in  the  face  of 
interactions.  We  need  to  specify  phenomena  like  termination, 
synchronization,  and  scheduling.  Those  phenomena  can  be  specified  only  by 
pointing  at  changes  in  data  configurations  in  the  time  dimension,  in  other 
words,  by  characterizing  a  system's  computational  history. 

Temporal  assertions  may  express  ordering  relations  (A  precedes  B)  with  no 
metric  time  attached,  timing  relations  (A  precedes  B  by  two  units  of 
time),  metric  properties  of  states  and  transitions  (this  transition  takes 
three  units  of  time),  and  properties  of  data  objects  at  particular  points 
in  a  system  history  (x=0  after  this  transition). 

Like  the  atemporal  language,  CSOL's  temporal  language  also  specifies 
behavior  in  terms  of  relationships  among  values  of  data  objects.  The 
essential  difference  between  atemporal  and  temporal  specifications  is  that 
temporal  specifications  are  concerned  both  with  values  and  with  the  order 
in  which  values  and  value  changes  arise  in  the  system  history.  The 
temporal  language  is  based  on  a  temporal  logic;  its  semantics  are  defined 
in  terms  of  a  set-theoretic  model  of  computation  and  a  model  of  time.  The 
model  of  computation  is  based  on  primitive  notions  of  data  value  and  data 
object.  The  model  of  time  is  based  on  the  real  line. 
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We  are  currently  experimenting  with  a  temporal  specification  language 
based  on  [HALP83,  MOSZ83].  This  is  documented  fully  in  [FRAN83b,  Chapters 
10-11]. 

Every  temporal  proposition  is  an  assertion  about  histories.  Temporal 
specifications  assert  facts  about  the  whole  system  history  by 
characterizing  sub-se-quences  of  that  history  and  the  order  in  which  those 
sub-se-quences  occur.  The  basic  sub-se-quences  from  which  assertions  are 
composed  are  named  by  propositions  from  the  atemporal  language. 

Some  temporal  propositions,  called  state  propositions,  characterize 
sequences  that  begin  with  a  particular  state.  That  is,  they  characterize 
a  system  as  being  in  a  certain  state.  Others,  called  action  propositions, 
characterize  sequences  that  begin  and  end  with  states  that  stand  in  some 
specified  relation.  They  characterize  a  system  transition.  Temporal 
specifications  that  specify  temporal  partial  orderings  on  the  state 
sequences  in  a  system  history  are  built  with  state  propositions,  action 
propositions,  and  composites  formed  using  conventional  logical 
connectives.  They  specify  properties  of  the  entire  system  history  by 
specifying  the  history's  structure,  a  partial  order  of  the  states  in  the 
history.  Given  a  particular  present,  we  may  specify  both  the  future  and 
the  past  of  a  computation's  history.  A  specification  about  the  future  is: 
"If  a  message  is  sent  by  module  A,  it  is  eventually  received  by  module  B." 
A  specification  about  the  past  is:  "If  module  B  receives  a  message,  it 
was  sent  either  by  module  A  or  by  module  C."  We  may  also  specify 
properties  of  the  entire  history;  one  such  is:  "There  is  never  more  than 
one  token  on  the  communications  bus."  Several  temporal  assertions  may 
specify  different  structures  for  the  system  history;  a  correct  system 
design  must  realize  ail  the  desired  structures. 

CSDL  uses  six  temporal  operators,  <I>,  <T>,  <A>,  [I],  [T],  and  [A].  In 
order  to  explain  their  semantics,  we  introduce  the  following  sequence 
notation:  Let  (R)  and  (S)  be  temporal  propositions  characterizing 
sequences,  and  let  s  =  sO,  ...»  sn  be  a  sequence.  Then,  informally: 

(R;S)  is  true  of  s  if  and  only  if  there  is  at  least  one  state  si  in  s  such 
that  R  is  true  of  the  subsequence  sO,  ...,  si  and  S  is  true  of  the 
subsequence  si,  ...sn,  0<i<n.  The  semicolon  is  the  basic  structure 
operator;  it  allows  the  expression  of  sequences  in  terms  of  ordered 
sub-se-quences.  Left  and  right  parentheses  delimit  sequences  specified  by 
their  structure. 

<I>  (S)  (read:  sometimes  initially  S>  states  that  S  is  true  of  s  if  and 
only  if  S  is  true  of  some  initial  subsequence  sO,  ...,  si  of  s.  We  can 
express  <I>  (S)  as 

<I>  (S)  =  (S;TRUE) 

where  TRUE  characterizes  all  non-empty  sequences  and  is  used  to  build 
"don't  care"  sub-se-quences. 
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<T>  (S)  (read:  sometimes  terminally  S)  states  that  S  is  true  of  s  if  and 
only  if  S  is  true  of  some  terminal  subsequence  si,  sn  of  s.  We  can 
express  <T>  (S)  as 

<T>  (S)  =  (TRUE;S) 

<A>  (S)  (read:  sometimes  somewhere  S)  states  that  S  is  true  of  s  if  and 
only  if  S  is  true  of  some  subsequence  si,  ...»  sj  of  s.  We  can  express 
<A>  (S)  as 

<A>  (S)  =  ( TRUE •,S; TRUE) 

[I]  (S)  (read:  always  initially  S)  states  that  S  is  true  of  s  if  and  only 
if  S  is  true  of  all  initial  sub-se-quences  sO,  ...,  si  of  s.  We  can 
express  [I]  (S)  as 

[I]  (S)  =  MOT  <I>  (NOT  S) 

[T]  (S)  (read:  always  terminally  S)  states  that  S  is  true  of  s  if  and 
only  if  S  is  true  of  all  terminal  sub-se-quences  si,  ...,  sn  of  s.  We  can 
express  [T]  (S)  as 

[T]  (S)  =  NOT  <T>  (NOT  S) 

[A]  (S)  (read:  always  somewhere  S)  states  that  S  is  true  of  s  if  and  only 
if  S  is  true  of  all  sub-se-quences  si,  . . . ,  s j  of  s,  0<i,J<n.  We  can 
express  [A]  (S)  as 

[A]  (S)  =  NOT  <A>  (NOT  S) 

[]  is  analogous,  in  the  temporal  context,  to  the  universal  quantifier  V. 
[ ]  is  a  universal  temporal  operator;  it  asserts  that  every  terminal 
subsequence,  initial  subsequence,  or  subsequence  of  the  sequence  under 
discussion  has  some  property.  <>  is  analogous,  in  the  temporal  context, 
to  the  existential  quantifier  4.  <>  is  an  existential  temporal  operator 
that  asserts  that  there  is  at  least  one  terminal  subsequence,  initial 
subsequence,  or  subsequence  of  the  sequence  under  discussion  that  has  the 
property  specified. 


1.4.5  Documentation  Format 

A  CSDL  design  document  is  simply  a  collection  of  all  the  type  definitions, 
type  refiners,  machine  definitions  and  machine  refiners,  arranged  in  any 
reasonable  way.  We  recommend  the  following  "loose-leaf-notebook”  style  of 
documentation  format:  machine  definitions  appear  in  a  flat  machine 
dictionary;  realizations  appear  separately  from  definitions.  Type 
definitions  appear  in  a  flat  type  dictionary;  refiners  appear  separately 
from  definitions.  By  flat,  we  mean  there  is  no  nesting  that  scopes  names. 

One  of  the  machines  in  the  machine  dictionary  and  the  companion 
realization  dictionary  must  be  the  distinguished  machine,  SYSTEM. 
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A  machine  realization  may  contain  type  definitions  for  types  used  only  in 
that  machine.  A  type  refiner  might  contain  private  type  definitions  but 
not  nested  refiners. 

All  machine  definitions  and  most  type  definitions  are  visible  system-wide. 
This  means  a  type  or  machine  may  use  any  type  or  machine  definition  in  the 
type  or  machine  dictionary.  Some  type  definitions  may  be  contained  within 
a  type  refiner  or  machine  realization;  these  types  are  available  only  to 
their  containing  structures. 

The  design  document  may  contain  several  of  each  dictionary,  and  within 
each  dictionary  every  entry  is  a  separate  item  from  every  other.  There  is 
also  no  importance  to  the  order  in  which  items  appears  in  the  loose-leaf 
notebook.  So  documentation  standards  in  different  organizations  can  be 
accommodated  by  putting  pieces  of  the  system  design  together  according  to 
each  organization's  documentation  standards. 

The  loose-leaf  style  puts  the  right  pieces  of  documentation  in  the  right 
hands.  For  example,  a  machine's  implementor  will  use  the  machine's 
definition  to  produce  its  realization  but  will  use  only  the  type 
definitions  of  the  types  that  machine  contains.  A  machine's  client,  on 
the  other  hand,  will  use  only  that  machine's  definition  and  the 
definitions  of  its  public  objects.  Obviously  there  must  be  tool  support 
for  combining  and  recombining  text  fragments  into  proper  configurations 
for  different  users. 

From  the  project  management  standpoint,  the  loose-leaf  notebook  is 
produced  a  piece  at  a  time,  so  there  are  clear,  limited,  and  fairly 
autonomous  tasks  to  be  managed.  A  tool  which  manages  the  design  text  can 
also  collect  project  management  data  about  changes,  number  of  accesses, 
versions,  and  so  forth. 


1.5  EXAMPLE 

This  is  an  example  of  a  machine  type.  Manager,  which  accepts  requests  from 
its  environment  and  returns  responses.  Figure  1-1  shows  Manager's 
definition.  It  has  two  public  objects.  Information  enters  Manager  from 
the  environment  through  "in,"  an  inlet  of  type  Request.  Information  flows 
from  Manager  to  the  environment  through  "out,"  an  outlet  of  type  Response. 
We  assume  that  Request  and  Response  are  defined. 

Manager's  public  behavior  is  specified  in  terms  of  its  visible  objects. 
The  first  clause  of  the  behavior  specification  asserts  that  any  terminal 
subinterval  of  the  system  history  that  starts  with  the  i-th  arrives 
transition  at  "in"  contains  the  i-th  leaves  transition  at  "out,"  where  i 
may  be  any  positive  integer.  In  other  words,  the  future  of  each  request 
arrival  contains  the  corresponding  reply  transmission.  The  second  clause 
asserts  that  the  i-th  reply  put  to  "out"  is  a  proper  response  to  the  i-th 
request  gotten  from  "in,"  for  all  positive  1.  Here  the  LET  facility  is 


used  to  introduce  temporary  logical  variables,  rep  and  req,  that  are  used 
in  specifying  the  desired  properties. 

Figure  1-2  shows  an  architecture  that  will  implement  Manager's  behavior. 
It  consists  of  the  two  visible  objects,  in  and  out,  a  machine  of  type 
Handler,  three  private  objects  and  one  procedure.  Controller.  Controller 
accepts  data  from  the  environment  through  in  and  enqueues  it  in  holding. 
It  also  dispatches  jobs  to  Handler  when  Handler  signals  that  it  is  ready 
to  accept  a  new  job.  Manager's  three  internal  objects  are  a  queue,  an 
outlet  for  sending  jobs  to  Handler  and  a  Signal_in  for  accepting  Handler's 
signals. 

The  queue  is  an  abstract  data  type;  Figure  1-3  shows  its  definition.  The 
queue's  type  operation  specifications  are  the  usual  kind  of  data 
characterization  specifications.  The  INIT  specification  says  that  a 
object  of  type  queue  is  empty  at  instantiation.  The  INVARIANT  bounds  the 
queue's  potential  length. 

Figure  1-4  shows  the  queue  type's  refiner.  Although  the  refiner  design  is 
not  needed  for  designing  Manager,  we  include  it  here  to  demonstrate  the 
use  of  mapping  functions.  Mapping  functions  allow  mechanical 
transformation  of  a  type  definition's  specifications,  which  are  stated  in 
terms  of  the  type's  model,  into  specifications  stated  in  terms  of  the 
type's  implementing  data  structure.  A  type's  implementation  can  be 
verified  against  these  transformed  specifications.  The  refiner's  INIT 
specification,  the  first  two  clauses  of  the  INVARIANT,  and  all  the 
procedure  specification  are  direct  translations  of  assertion  that  appear 
in  the  type  definition.  The  remaining  six  conjuncts  of  the  data  INVARIANT 
are  needed  once  the  choice  of  implementing  data  structures  is  made.  These 
six  clauses  are  invisible  to  users  of  the  type;  they  concern  only  the 
representing  data  structure. 

Signal_in's  type  definition  is  shown  in  Figure  1-5.  Its  complements 
specification  defines  an  active  type  that  is  Signal_in's  complement  (see 
Section  1.4. 3, 2.1).  Its  INIT  specification  says  that  the  initial  value  of 
an  object  of  type  Signal_in  is  FALSE.  Its  one  operation,  ready,  returns 
the  value  of  the  signal  (TRUE  or  FALSE)  and  leaves  the  signal  FALSE.  This 
type  is  designed  in  tandem  with  designing  Manager's  controller  (Figure 
1-7).  In  particular,  ready  is  designed  as  a  non-blocking  operation 
because  it  is  to  be  used  inside  a  blocking  repetition  construct. 

At  the  level  of  designing  Manager's  realization,  we  are  interested  only  in 
Handler's  definition,  which  is  shown  in  Figure  1-6.  Like  Manager's 
definition.  Handler's  consists  of  some  public  object  declarations  and  a 
behavior  specification.  The  specification's  first  clause  asserts  a 
liveness  property  of  the  Handler,  that  one  reply  is  eventually  put  out  for 
each  request  that  arrives.  The  second  clause  is  an  assertion  about  the 
past  rather  than  the  future.  It  says  that,  for  any  positive  i,  the 
initial  subinterval  of  the  system  history  which  ends  with  the  i-th  request 
arrival  contains  a  terminal  subinterval  that  begins  with  the  departure  of 
the  i-th  signal.  In  other  words,  the  i-th  request  must  have  the  i-th 
signal  in  its  past.  The  public  objects  are  an  inlet,  an  outlet,  and  a 
Signal_out,  which  is  Signal  in's  complement.  At  this  level  of  refinement. 
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we  do  not  need  to  know  even  Signal_out's  specification.  This  is  an 
example  of  the  extent  of  separation  of  concerns  that  CSDL  allows. 
Signal_out  will  be  specified  in  the  context  of  designing  Handler,  and 
implemented  when  convenient. 

Figure  1-7  shews  Manager's  realization.  Manager  concains  four  local 
objects.  The  object  server  is  an  instance  of  a  Handler  machine.  It 
constitutes  an  autonomous  control  site  running  concurrently  with  Manager. 
Server  comes  into  existence  when  an  instance  of  Manager  does.  The  linking 
specification  following  indicates  how  server's  public  objects  are 
linked  to  its  environment.  Linking  is  always  done  in  the  context  of 
machine  creation.  Server's  outlet,  done,  is  bound  to  Manager's  outlet, 
out.  Binding  outlet  to  outlet  means  that  for  server's  lifetime  it 
controls  one  of  its  parent's  objects;  this  allows  server  to  return  replies 
directly  to  its  parent's  environment.  Job  and  want  are  local  active 
objects  through  which  an  instance  of  Manager  communicates  with  Handler. 
Job  is  connected  to  server's  inlet;  want  is  connected  to  server's 
Signal_out.  Holder  is  a  Queue  object  that  Manager  uses  to  hold  pending 
Requests. 

Manager's  public  specifications  are  identical  to  the  ones  in  the  machine 
type  definition.  Its  internal  specification  asserts  that  the  requests 
submitted  to  the  handler  are  just  those  that  had  been  previously  received 
from  the  environment. 

Manager's  controller  is  a  (non-terminating)  blocking  repetition  statement 
which  waits  on  two  active  objects.  When  information  arrives  at  the  inlet, 
in,  (that  is,  when  in.came=TRUE) ,  and  the  queue,  holder,  is  not  full,  the 
Controller  gets  the  arriving  data  from  in  and  enqueues  it.  When 
want.ready=TRUE  and  the  queue  is  not  empty,  the  controller  gets  a  job  from 
the  queue  and  passes  it  to  server,  the  Handler  instance,  by  putting  it 
into  the  Request  outlet,  job,  that  is  connected  to  server's  Request  inlet, 
next.  This  controller  has  such  a  compact  design  because  most  of  the 
design  work  needed  to  attain  this  functionality  was  invested  in  designing 
the  data  types  inlet,  outlet.  Queue,  and  Signal_in. 
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Manager  IS 

PUBLIC  in;  Request  IMLET; 
out:  Reply  OUTLET 

BEHAVIOR 

V  i:  Posint(i) 

[T](  <i>  arrivesdn,  In') 

=>  <T>  <i>  leaves(out,  out')  ) 

AND 

V  i:  Posint(i)  [ 

LET  req,  rep:  Request(req)  &  <i>  in. get'  =  req 
4  Reply(rep)  4  <i>  out.put(rep) 

[  response (req,  rep)  ] 

] 

END  {Manager} 

Figure  1-1:  Manager  Machine  Definition 


Thing  queue (Thing; TYPE, n: INTEGER)  IS 
MODEL  Thing  ARRAY 

LET  tq;Thing_queue(n) 

INVARIANT  0  <  tq.dom  AND  tq.dotn  <  n 
INIT  tq.dom  I  0 

OFUN  empty 
PRE  TRUE 
POST  tq ' . dom  =  0 

OFUN  enqueue  (t:Thing) 

PRE  tq.dom  <  n 

POST  tq'.hib  =  (tq.hib  +  1 )  AND  tq'.high  =  t 

OFUN  dequeue  RETURNS  Thing 
PRE  tq.dom  >  0 

POST  tq'.lob  =  (tq.lob  +1)  AND  dequeue'  =  tq.low 

VFUN  is  full  RETURNS  BOOLEAN 
PRE  TRUE 

POST  is_full'  =  true  IFF  tq.dom  =  n 

VFUN  is  empty  RETURNS  BOOLEAN 
PRE  TRUE 

POST  is_empty'  =  true  IFF  tq.dom  =  0 
END  {Thing_queue} ; 

Figure  1-3:  Queue  Type  Definition 
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Q_as_buffer  REFINES  Thing_queue(Thing,n) 

LET  tq:Thing_queue 
TYPES 

Clrcular_Buffer  IS  [circle:  Thing  ARRAY;  front,  rear: INTEGER] 
OBJECTS 

buffer:  Circular_Buffer 
MAPPING 

buffer. rear  REPRESENTS  tq.lob; 

(buffer.front-1)  MOD  (n+1)  REPRESENTS  tq.hib; 

IF  buffer. rear  <  buffer. front 
THEN  V  i:INTEGER(i) 

[IF  buffer. rear  <  i  AND  I  <  buffer. front  -1 
THEN  buffer.circle(i)  REPRESENTS 

tq( ( i-buffer.rear)  MOD  (n+1)  +  tq.lob)]; 

IF  buffer. rear  >  buffer. front 
THEN  V  i:INTEGER(i) 

[IF[0  <  i  AND  i  <  buffer. front] 

0R”[ buffer. rear  <  i  AND  i  <  n] 

THEN  buffer. circle  Ti)  REPRESENTS 

tq( ( i-buffer.rear)  MOD  (n+1)  +  tq.lob)]; 

(buffer. front-buffer. rear)  MOD  (n+1)  REPRESENTS  tq.dom 

INVARIANT 

0  <  (buffer. front-buffer. rear)  W)D  n+1  & 

( buffer. front-buffer .rear )  MOD  n+t  >  n  4 
buffer .circle. lob  =04  buffer. circle. hib  =  n  4 
buffer. front  >04  buffer. front  <  n  4 
buffer. rear  >04  buffer. rear  <  n 

INIT 

(buffer. front-buffer. rear)  MOD  n+1  =  0 

[Procedures  to  iraplenent  type  operations) 

empty 
PRE  TRUE 

POST  buffer ' . rear  =  buffer ' . front 
enqueue  (t:Thing) 

PRE  (buffer. front-buffer .rear)  MOD  (n+1)  <  n 
POST  buffer' .front  =  (buffer .front  +  1)  MOD  (n+1) 

AND  buffer' .circle  (buffer ' .front)  =  t 
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dequeue  RETURNS  Thing 

PRE  (buffer. front-buffer. rear)  MOD  n+1  >  0 
POST  buffer '.rear  =  buffer. rear  +  1  MOD  n+1  & 
dequeue'  =  buffer.circle{ buffer. rear) 

is  full  RETURNS  BOOLEAN 
PRE  TRUE 

POST  is_full'  =  [ (buffer. front-buffer. rear)  MOD  (n+1)  =  n] 

is_empty  RETURNS  BOOLEAN 
PRE  TRUE 

POST  is_erapty'  =  [( buffer. front-buffer. rear )  MOD  (n+l)  =  0] 

Figure  1-4:  Queue  Type  Refiner 
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Signal^in  IS 
MODEL  Boolean 

LET  si: Signal  in;  so: Signal  out 
COMPLEMENTS  sT,so 
INIT  sisFALSE 

OFUN  ready  RETURNS  Boolean 
PRE  TRUE 

POST  si'=FALSE  &  ready  =  si 
END  {Signal_in} 

Figure  1-5:  Slgnal_in  Data  Type  Definition 


Handler  IS 

PUBLIC  next:  Request  INLET; 
done:  Reply  OUTLET; 
need:Signal_out 


BEHAVIOR 

V  i:  Posint(i) 

[T](  <i>  arrives(next,  next') 

s>  <!>  <i>  leaves (done,  done')  ) 
AND 

V  i:  Posint(i) 

[I](  <i>  arrives(next,  next') 

=>  <T>  <i>  signals(need,  need')  ) 


END  (Handler} 


Figure  1-6:  Handler  Machine  Definition 
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CSDL:  CONCURRENT  SYSTEM  DEFINITION  LANGUAGE 


Manager 

PUBLIC  In:  Request  INLET; 
out;  Reply  OUTLET 

BEHAVIOR  {externally  visible} 

V  i;  Posint(i) 

[T](  <i>  arrivesdn,  in') 

=>  <T>  <i>  leaves(out,  out')  ) 

AND 

V  i:  Posint(i)  [ 

LET  req,  rep:  Request (req)  4  <i>  in. get'  =  req 
4  Reply(rep)  4  <i>  out.put(rep) 

[  response(req,  rep)  ] 

] 

OBJECTS 

server:  Handler  :=  {  done:=out;  {binding  OUTLET  to  OUTLET} 

next  TO  job;  {connecting  INLET  to  OUTLET} 

need  TO  want  {connecting  Signal  out  to  Signal  in} 

); 

job:  Request  OUTLET; 
want:Signal_in; 
holder: Request  Queue 

BEHAVIOR  {internal} 

V  i:  posint(i)  [ 

Cl](  <i>  job. put  =>  <T>  <i>  arrivesdn,  in')  ) 

AND 

<i>  in. get'  =  THE  req:  Request(req)  4  <i>  job.put(req) 


CONTROLLER 

WHENEVER  -holder . is_full  4  in. came  ->  holder .enqueue ( in. get) 
0  want. ready  4  -holder. is  empty  ->  job. put (holder .dequeue) 
END 

END  {Manager  realization} 

Figure  1-7:  Manager  Machine  Realization 
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1.6  DISCUSSION 


The  software  engineering  project's  goal  is  software  engineerinjt 
technology:  formal  models,  design  techniques,  technical  methods  and 
languages.  Because  formal  models  are  an  engineering  prerequisite  for 
languages  and  technical  methods,  they  have  received  much  attention  and  are 
the  most  mature.  Because  language  makes  methodology  and  technical  methods 
concrete,  CSDL's  language  components  are  also  relatively  mature.  In  the 
context  of  CSDL  we  have  done  almost  no  work  on  design  analysis  techniques 
and  tools,  but  the  existence  of  models  and  formal  languages  means  that  the 
framework  is  prepared.  Formality  also  means  that  the  foundations  for 
support  tools  are  in  place. 

So  far,  the  weakest  precondition  predicate  transformer  technique  has  been 
used  rigorously  by  very  few  designers.  Practitioners  have  been  extremely 
reluctant  to  give  up  their  familiar  informal  design  styles  for  a  formal 
design  method  that  requires  a  large  learning  investment  in  basics  like 
predicate  calculus,  in  the  CSDL  notation  and  in  the  method  itself.  As 
long  as  we  do  not  offer  a  tool  that  generates  weakest  preconditions  from 
postconditions  and  algorithmic  statements,  we  do  not  expect  algorithms  to 
be  rigorously  constructed  in  CSDL.  However,  the  exercise  of  writing  a 
specification  informally  using  the  constructive  technique  and  examining  an 
algorithm  to  convince  oneself  that  it  meets  the  specification  does 
increase  confidence  in  the  design  produced. 

In  reality,  CSDL,  with  its  "formal  purity,"  is  an  investment  in  the 
future.  When  the  need  for  probably  or  constructively  correct  software 
becomes  so  great  that  a  large  dollar  investment  in  tools  is  warranted, 
CSDL  will  be  available  as  a  language  whose  constructs  have  proof  rules  and 
semantics  defined  in  terms  of  a  formal  model.  In  the  short  term,  parts  of 
CSDL  can  be  used  along  with  less  formal  notations,  for  example,  English 
language  specification  and  designs  produced  in  CSDL  notation.  This 
produces  better  designs  than  those  created  with  a  less  complete  notation 
and  paves  the  way  for  designers  to  move  into  an  entirely  formal  system. 
There  is  little  specification  support  for  properties  like  performance  and 
reliability.  Formalizations  of  these  properties  that  can  be  related  to 
the  computational  model  are  required  in  order  to  develop  the  linguistic 
mechanisms. 


136 


Appendix  II 


PUBLISHED  PAPERS 


(t)  "An  Object-Oriented  Design  Model  for  Reliable  Distributed  Systeas"  in  the 
Third  Syaposlua  on  Reliability  in  Distributed  Software  and  Database 
Systeas,  Clearwater  Beach,  Florida,  October  1983* 

(2)  "Zeus:  An  Object-Oriented  Distributed  Operating  Systea  for  Reliable 
Applications"  in  the  ACM  national  Conference,  San  Fransisco,  October  1984. 

(3)  "Soae  Perforaance  Models  of  Distributed  Systeas"  CMC  XV  International 
Conference  on  Coaputer  Perforaance  Evaluation,  San  Fransisco,  Deceaber 
1984. 
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Abatract 

Tbia  paper  daaeribaa  an  obJaet-orlaBtad  daalgn 
aodal  for  structuring  reliable  distributed  ayateaa. 
A  systea  is  viawee  as  a  oolleetioa  of  objeots  that 
are  accessed  and  aodifiad  by  trattMctiou. 
Recovery  tectaniquas  are  incorporated  to  ■aka 
transactions  atcaio  in  the  preaanoe  of  coapoMat 
craabes  and  concurrent  operations.  Atoaicity  of 
traBMCtiona  is  baaed  on  constructing  recoverable 
objects  using  aultiple  versions  and  eoaait 
protocols.  These  concepts  are  extended  to  nested 
transactions.  The  operations  on  distributed 
objects  are  peforaed  aa  reaote  procedure  calls. 
This  requires  iapleaentation  of  reaote  prooadure 
calls  In  a  reliable  fashion.  The  facilities  of 
reliable  nested  tranMctiona  and  raaota  procedure 
calls  are  used  to  synthesise  distributed  objects 
that  are  highly  reliable. 


1.0  TBtrorfuBtlOH 

The  architectural  features  of  distributed  systeaa, 
such  as  physical  isolation  between  systea 
coaponenta  which  tends  to  reduce  correlation  aaong 
coaponent  failures,  and  redundancy  of  resources  to 
support  continued  operations  in  the  presence  of 
coaponent  losses,  offer  great  potential  for 
designing  reliable  systeas.  This  potential  hu 
reaained  largely  unexploited,  however,  because  of 
the  lack  of  a  foraal  discipline  to  integrate  the 
known  existing  recovery  techniques  into  distributed 
systeas  desi^.  In  this  paper  we  present  a  design 
■odel  for  distributed  systeas  which  facilitates  a 
syateaatio  and  wall-structured  integration  of  known 
recovery  techniques  into  the  designs  of  distributed 
systeas. 

In  constructing  reliable  systeas,  the  aaintenanoa 
of  recoverable  consistent  states  of  objects  is  an 
iaportant  problea  for  systea  recovery.  Another 
problea,  which  is  functionally  orthogonal  to 
recovery,  is  concurrency  control  in  distributed 
systeas.  The  solutions  to  these  two  design 
problaas  interact  closely. 

Object-oriented  designs  offer  aa  attractive 
apprcecii  to  constructing  reliable  systeas  by 
confining  errors  in  the  systea,  by  defining 
consistent  systea  state  to  support  rollback  and 
restart,  and  by  Halting  propagation  of  rollback 

This  work  was  supported  by  RAOC  Contract  No. 
F30602-82-C-015A 


activities  in  concurrent  systeas.  An 
object-oriented  approach  is  comprised  of  objects 
accessed  or  updated  by  users  through  transactions, 
a  saquonoe  of  priaitive  operations  on  a  set  of 
objects.  A  transaction  is  viewed  as  a  unit  of 
error  recovery  and  synohrenixatioa  in  the  systea. 
The  key  to  designing  reliable  systeas  is  the 
atoaicity  of  traaseotions  and  a  sufficient  level  of 
redundancy  in  the  systea  to  support  continued 
operations  in  case  of  loss  of  objects  (i.e.  systea 
coaponenta) . 

iaapsoB  and  Sturgis  [LAMP76],  and  Gray  [GRAn9] 
introduced  independently  the  concept  of  eoaait 
protocols  to  laplaaent  atoaic  actions  on 
distributed  objects  in  the  presence  of  systea 
crashes.  The  nested  transaction  facility  ia  used 
for  perforaiBg  distributed  concurrent  operations. 
Constniotiag  nested  traaseotions  introduces  tbs 
coBoept  of  ■conditional  eoHataent*.  The 
coaaitaont  of  a  nested  transaction  is  dependent  on 
the  ooaaitaent  of  the  parent  transactions.  Most 
discusaiona  on  this  topic  have  benefited  frea  the 
concept  of  *aphere  of  control,*  first  introduced  by 
Davies  [DATITB].  A  process  execution  is  viewed  as 
a  *spbere  of  control*  within  which  the  process 
changes  the  states  of  the  objects  and  controls  the 
coaaitaent  of  these  changes.  Once  coaaitted,  the 
changes  aade  within  a  *sphere  of  control”  can  never 
be  revoked.  The  problv  related  to  nested 
transactions,  and  the  designs  to  address  these 
problMs  have  been  discussed  by  Shrivastava 
(SHRX82b],  Reed  [REE078]  and  Moss  [M0SS81]. 

In  the  proposed  aodel,  operations  on  reaote  objects 
are  perforaed  as  reaote  procedure  calls,  requiring 
reliable  laplOTentation.  Discussions  on  reliable 
reaote  procedure  call  aodels  have  appeared  in 
recent  literature,  aost  notable  the  discussion  by 
Spector  [SreC82],  Nelson  [NELSSl],  Shrivastava 
CSHRI82a]  and  Laapaon  [LAMP8lb]. 

Creating  saall  protected  doaains  that  interact 
through  well-defined  interfaces  plays  an  iaportant 
role  ia  syatea  recovery  by  confining  error 
propagation.  Object-oriented  designs  facilitate  a 
syateaatio  construction  of  such  snail  protected 
doaaias  for  error  recovery.  Recently,  the 
object-oriented  designs  have  been  used  by  {.iskov 
[LZSX82b],  Shrivastava  CSHRI81],  Svobodova 
[SV0B81],  Reed  [REED79]  and  several  others  to 
structure  distributed  systeas  for  high  reliability. 
The  schaae  proposed  by  Reed  was  the  first  to  use 
Bultl  pie- version  facilities  to  iapleaent  atoaic 
actions.  This  schaae  has  been  used  to  iapleaent  a 
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r«iiaU«  scoraf«  faellitlM  for  objoeto  ia  bbo 
SUALLOH  CSVQB8t]  doalcb*  Uottov  CUXSCSSb]  bu 
propeaod  objoet-orlcntob  lAaculbtle  — oboni  ■■ 
boMb  08  ciM  eoaeopcb  of  taor^oaa  «ab  aotAoaa  eo 
eoBObruee  roiiablo  blabrlbubod  srstoas. 


Tbo  prebloBO  roXaboB  bo  rollbaek  of  proooaaaa  ia 
eoaeurronb  syabaao  bavo  boon  abdroaaad  bjr  aovaral 
roMarcbara  [RUSS80]  CCIKr9].  Oaa  aXcalfXeaat 
problv  ia  bbo  doaiao  of  foot  arialac  from 
uaabnicburod  iaboraebXoas  aaonc  aetxvbblaa  aad 
cauaioa  a  eaaeaOa  of  roXlbaek  aaooc  eoneurroat 
aebivitiaa.  Xa  obj«eb>oriaabod  ayabaaa,  bba 
acb^Tibxaa  aro  braaaaebxo^baood  ao  bbae  bbo 
labaracbXona  aaoa<  acbXYlbXoa  aro  wolX^abrueburod 
aad  ctXaeipiiboo.  Aaaia,  "apboro  of  eoabrol*  bolpa 
IXaib  bbo  rollbaek  aeblrlby. 

HaaasiBc  roduadaacy  la  bba  ayaboa  la  bbo  fora  of 
roplleabloB  of  objooba  or  eroatioa  of  backup 
sbjocba  la  laportaat  for  aupportlac  aoatlauod 
oparabloea  la  bbo  ovoab  of  loaa  of  roaeta^a.  tbo 
aajor  preblaa  la  rocuabaacy  aaaacMoat  la 
aaiabalainc  eoaalabaaey  aaoa(  roplloabod  abJoeta« 
aac  Baviac  eurroab  atabo  laforaabloa  «lCb  baekup 
Boculoa  bo  aupporb  roeoafipuratloa.  Tbo  aolutloaa 
bo  bbla  preblaa  koop  a  aajorlby  or  a  aurrloablo  aot 
of  bbo  ropllaatod  eoploa  la  a  ooaalatoat  atata. 

la  Socblaa  2  wo  proaoat  bbo  abstraet  daalpa  aodal 
for  •  roiiablo  aiatrlbutod  ayataaa.  Soatlaa  3 
rovlowa  brlafly  'oarleua  roaaoory  toobatquaa 
applleablo  bo  dlatrlbutod  ayataaa.  Saotloa  t 
^acacraboa  eboao  toebalpiwa  lato  tbo  doalpa  aodal. 
Ab  aaeb  lovol  of  abatraotloa,  approprlato  rooooory 
boebalquaa  aro  doaorlbod. 


2.0  n<  «ti.^  k...~4  — 

A  daalpa  aodol,  laaplrod  by  Laapaoa'a  labtloo  aodal 
[LAHfBlo]  for  eeaatrueblod  roiiablo  dlaeribvtod 
syabaaa,  la  abowa  la  Tiguro  1 .  Tbo  obJootlTO  of 
roiiablo  dlatrlbutod  ayataa  doalpao  la  to 
ayabboaixo  aocuro  aad  atablo  dlatrlbutod  objocta 
bbae  aurriro  ayataa  oraaboo  aad  aupport  blpb 
fuaeblea  availability  of  aarviooa.  Suob  objoota 
aro  eeaatruotod  ualac  uarollablo  roaoaroaa  aaob  aa 
pnysieal  atorago  (dlaka),  pbyaleal  proooaaor,  aad 
baa  eeaauaelabloa  aodlta. 

Xn  bAls  saebloB  wo  doaerlbo  bbo  doalia  aodal  abowa 
in  Flcxtra  1  la  a  *bobba^up"  faabloa.  Thla  aodol 
13  oao  poaalblo  approoob  to  doalcalac  roiiablo 
blsbribubod  syaeoaa.  Zb  la  partloularly  aaltod  for 
aa  objoeb-enoatod  ayataaa  la  wblob  latoraotloaa 
aaong  objocba  aro  braaaaetlo»»boood.  Ho  Idaatlfy 
bba  fuaeblona  of  aaeb  lovol  la  tbo  vapb  abowa  la 
Plgura  1 .  Xa  Soetloa  t  wo  doaorlbo  tbo  applloatloa 
of  varieua  roeevory  aanhant  wo  to  aoblovo  tbooo 
fuaeblooa  at  aaeb  lovol  of  tbo  daalvi  aodol. 

Tbo  ah»«tg»i  rofora  to  ooi^volatllo  dlafe 
steraga  wibb  a  neiwxoro  probability  of  laforaatloa 
1033;  for  axaaplo.  a  poga  oa  a  dlak  aay  bo 
serrupbad  by  a  hoad»oraab  or  otbor  aalfuaotloa. 
Suen  failuroa  oan  bo  ebaraeborlxod  by  rollablllty 
xoa3uro3  3uoh  03  bbo  aoaivbla^bo«falluro  or  a 


rollabUlby  fubetloa.  Anotbor  problaa  witn 
ptayalool  atorago  la  bbo  aoi^abealelty  of  wrlbo 
oporauoaa  oa  pogoa,  for  axaaplo.  a  eraab  oay  oeeur 
la  bbo  dlak  ayataa  duriag  wriuag  a  now  valvjo  on  o 
paga.  laovlag  tbo  page  eorrupbod  boeauao  bba  old 
valua  baa  boon  daatroyod  aad  bbo  now  valvjo  boo  nob 
baoa  writtoa  ooaplotoly. 

fbo  faelllby,  eoaabruebad  froa  bbo 
pbyaleal  dlak  atorago.  providos  oboolc  wrlbo 
oporatleaa  oa  pagoa  wblla  aobonelng  bba 
availability  of  data  by  roplleabloa. 

X  loaoa  Ibo  eoabrol  sbobo  da  bo 
OB  eraaboa.  A  roatarb  oporabloa  eouaoa  a  proeaaa 
to  aaaeuto  froa  bbo  boglaalag.  A  ppoggaaor 
facility,  oa  bbo  otbor  baad.  auppcrbo  aavxna 
preoaaa  atatoa  oa  bba  atablo  atorago  aad  roatarbing 
a  procoaa  froa  a  pravloualy»aavod  precoaa  stabo. 
gavlog  procoaa  atatoa  la  oallod  ^ggshgaiafilg^ 

Xa  our  aodal  tbo  ayataa  ceaalata  of  a  eoUoetloa  of 
objoeta*  aadt  of  wblob  la  of  a  woll-doflaod  iTBa. 

by  BB  ^gggggg.  Xa  addlbloa  to 
auppertlbg  oporabloaa  oaaoclabad  wibb  bbo  bypo 
doflaiuoa.  bbo  objoct  aanagar  for  a  typo  alxo 
oroataa  objoota  of  tbab  typo,  or  doatreya  acao 
aHatlag  objoota  of  tbab  typo.  A  ayato^wldo 
objoot  oallod  tvo  agglMC  faollltatoa  tbo 
latrogaatloa  of  aaw  typo  doflaltleaa  la  bbo  ayatm. 
Tuf  approaab  la  baaad  oa  tbo  prlaeiploa  followod 
la  tbo  daid«i  of  Hydra  [COHITS].  Tbo  typo  oaaagor 
^joat  la  ow  aodol  oorroapoada  to  tbo  TtrUTm 
objoat  la  Hydra. 

tiM  aaxt  lovol  of  abatraoUoa  previdoa 

baOOd  OO  OtOblO  atOTI^. 
.atablo  preaaaaar  aad  (DXS) 
faoUltloa.  Stablo  objoota  aro  tboao  tbat  aurvlvo 
ayataa  wabbaa  wltb  a  blgb  probobUlty  aad  for 
wblob  tbo  prlaltlvo  oporatloaa  (l.o.  tbo 
oparauooa  aupportod  by  tbo  typo  doflaiuoa)  aro 
atoolo.  aooM’o  oblodta  aro  protoctod  objoeta  wbleb 
oaa  bo  aoooaaad  oaly  by  autborlxod  uaora.  Xa  our 
•Bdol,  preooaaoa  aro  eeaaldorod  aa  objoeta 
aupportod  by  a  atablo  preooaaer  facility. 

OolBg  tbo  0X8  faolUty.  ovory  -hjoot  la  tbo  ayataa 
la  glvoa  a  globally  uald«o  oamm,  ttala  aaao  la 
Bovor  rouaod  la  tbo  oatlro  Uf^tlao  of  bbo  ayataa. 
Proa  bbla  uaiguo  Idoablflor,  tbo  typo  of  bbo  objaet 
oaa  bo  laforrod.  Tbo  0X8  aloe  IdoaUfloa  bbo  aodo 
wboro  tbo  objaet  wax  eroatod.  Objoeta  la  tbo 
ayatw  aay  oigrato  froa  oao  oodo  to  aaebbor.  Tbo 
OZB  faellicy  doflaao  tbo  logleal  aoao  apaoo  la  tbo 
ayataa.  Oporatleaa  oa  aa  objoot  aro  lovokod  by 
apodlfylag  tbo  0X8  of  tbo  objoot  aad  bbo  oporaUoa 
MO.  lononao  tbo  ualdua  idoatlflor  allma 
dotorolaaUoa  of  tbo  typo  of  tbat  ebjoet,  bbo 
oporaUoa  lovooatloa  oa  aa  objoot  la  dlroetod  to 
tbo  approprlato  objoot  aaaagar  for  tbat  typo, 
looaaao  tbo  oporatloaa  oa  tbo  roaoto  aad  local 
objoota  ora  lovokod  la  aa  Idoatloal  faabloa*  wo 
flod  tbo  roaoto  prooodura  eall  paradl^  a 
eoavoalaot  abotraetloo. 

Tbo  OZB  goaaratloa  la  boaod  oa  bbo  atablo  atorago 
aad  bbo  atablo  proooaaor  facUlUoa.  Tbo  UZfi 
gonoratloa  facility  la  boaod  oa  a  local  clock 
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Ptiy«ieal  Physical 

Storags  Piocasaor 


Communication 

Medium 


A  Oooign  MoM  far  Palabia  tMribuM  SysiMns 

pnuRti 


proMU  or  0  aoquaoM  oouator  tlwe  umo  CM  ataUo 
atoroso  to  aurvlvo  ayacaa  oraAM  oaS  to  a— imo 
Chat  Ulo  aaao  OS  la  aac  raaaaarataS  aa  root  art  of 
a  aooo  aftar  a  oraas.  Tha  OO  far  aa  aSJaat 
laaioataa  tBa  typa  of  taa  aSJaat  aaS  tbo  aaSa  aSora 
It  waa  eraataS.  A  aeBaaa  far  saaaratlas  OO  la  a 
roliabla  faahloa  la  Ooaarlbad  la  CSCHlSJ]* 

T>.«  abatraetloa  of  raaeraraUa  objaeta  praaldaa 
aacaaolaaa  co  raaearlas  taa  ataca  of  aa  objoat 
aftar  bavios  oaaa  aoaa  obaosaa  ta  it,  or  eaaaltuas 
a  eaaoco  to  taa  abjaot  acata.  TSa  oaaaapt  at 
ooaaltaaat  forbiaa  oay  raatoratloa  ta  atataa  bofera 
rwaal  raanr.  rnMiraaat  of  a  nhansa  ta  aa  abjaat 
lapllaa  aaO,  ta  t^a 

obiaat  aiooa  taa  loot  oaoslt  aporatlao. 


itMlB  troMantlnni  ara  laplooaatad  ualnc  cba 

faollltloa  SaaoriboS  abova  aad  aoaa  oaaourraaey 
watral  aanSaHlMi,  A  traaaaotloa  obauia  bo  atooia 
la  tba  praaoooa  at  oaosurroot  aparatlaoa  aaS  ayataa 
eraabaa.  Ataalolty  af  aaaourraat  traaaaatloaa 
raquiraa  Mltahla  aacaaaiaaa  for  eoacurraney 
ooatrol.  Thora  ara  baalcally  two  Olatiact 

approaebaa  to  eaaeurrancy  eoatrol:  lociuaa 

protooola  [ESIIA761  aaO  tlaa-ataap  baaao  setioaaa 
CBCUSi].  Aocorarabla  ebjaota  auppert  aebaaaa  to 
ataalolty  of  traaaaatloaa  la  taa  praaanea 
of  ayataa  oraasao.  Traooaotloaa  la  our  ooCal  ara 
traataa  aa  objaota  af  proooaa  typa.  Aa  la  taa  eaaa 
of  aay  otbar  abjaot  la  tbo  ayataa,  a  traaaaotloa  la 
iaalsaaS  a  OU. 


So  uaa  too  aoaoapt  of  ItfHlS  TSniMS  «• 
lapiaaaat  oatabla  raaaaarofela  abjaata.  Aa 
looutabla  objaot  la  aoa  toat  la  aoaor  osaosaS  aoaa 

It  la  eraataO,  l.a. ,  aaary  oaaasa  to  aa  abjaat 
traataa  a  now  objaot.  Xa  our  aeOal  aaary  oBaafa  to 
an  objaot  craataa  a  now  aw  of  tbat  objaot; 

taia  varaioa  la  uaioualy  loaatlflabla  by  usia«  taa 
oa  of  taa  objaot  aoa  tba  aaraiaa  oiabor.  TItaaa 
priaoiplaa  haaa  baaa  Olacuaaaa  la  Ootall  la 
[IflOTS]  ana  CSVOSSl], 


JbHfiSA_SCiaslS&laaa  aoaOla  aeoatraotloc  blgaar 
fbaols  of  abatroatlaos  by  onapnalin  a  aat  of 
traaaaatloaa  lota  aoa  larfor  traaaaatloo.  Saatao 
traooaotloaa  ara  alao  uaaful  for  IntroOuolns 
parallallaa  withla  aa  atoole  action.  Tho 
roant  of  eoaputatlooa  by  oaea  of  tho  noatao 
traaaaotlooB  la  Oapaoaaat  on  tbo  ooaaltaoot  of  tho 
paraot  traaaaotloa.  Caaourroooy  eaotrol  aaebaaiaaa 
b*^  raqulraa  to  ayaobraalaa  oaatoo  traaaaatloaa  of 
tba  aaoa  or  Olfforoat  paraot  traaaaatloaa. 
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••cftiiilaB  la  b«a«d  on  th« 
t^«altcti.on  facility  eo  •asiir«  Ui*  au^elty 
of  OPM-OCIOU  la  UM  pr«MOM  of  tratmvmatt  oM 
otlior  ooaaarraat  traaaaatloafc  *■• 

XDimturT  oaLl  —Tfiirl*"  <>•••  oarallaU*  ftlffTM 

fMUitj  men  hi^  proaabUitjr 
mms«>  doll»«rr.  Tfc*  diaaooalrtoo  la  CatlUal  ood 
[LXSK82a]  auppopt  buUdlac  rollohlo  r««oto 
proeodur*  call*  uoias  l«a*  aopfclatleatod  foelllclos 
sued  aa  «  dae*«ra«.  Thoao  oad-to-oad  arvaaata 
CSALT81]  poiat  out  tdo  waatoful  duplloatloa  of 

functlona  at  diffaront  lo»oia. 

.mieatiQB  la  acOloaod  Of  oaoryptin* 

Msoacoa  and  atoria®  noo-aaoryptod  aaaaacaa  la 
protaetad  buffara. 


1  JazUH  ^ 


Cooaaptually,  ttara  ara  thraa  fttBdaaaatal 
atrataflaa  iacorparatad  la  avary  rallaMLa  ayataa 
daoidB.  Ihaoa  ara  arror  dotoatlaa*  daM«a 
aaaaaaaaat,  aad  arror  raeovary.  Tha  fallawlm 
parta  of  thla  aaotloa  daaorlba  tOa  aoat  oooaoa 
taeaalquaa  for  arror  dataotloa  aad  raoaaary. 


na  rallablllty  of  a  daal^  dopaada  oa  taa 
atnagaaoy  of  tda  taeaalquaa  for  dataotlac 
arroaaoua  atataa  la  taa  ayataa  that  oay  load  ta 
ayataa  falluraa. 


Sooa  caaaral  Caotalquaa  for 
C1R0C79]  ara  daaarlbad  taloa. 


dataotlaa 


laplloatloa  Chaaka:  Xa  auob  aaliaaaa,  aa 
aotlvlty  la  raplloatad  tod  taa  raaulta  ara 
chaokad  for  ooaolacaaey.  Aa  taaoaalataaey 
ladloataa  a  poaoiaia  arror  ooadltloo.  Irrora 
eao  bo  oaakod  by  aajorlty  rotlac  aa  la  Triple 
Nodular  Roduadaat  ayatooa. 


taa  intarfaeaa.  Coaflnln<  arrora  la  atronfl/ 
dapaadaat  oa  taa  atrla<aaey  of  tha  aecaptanea 
toata.  la  dlatrlbutad  ayatooa,  latarfaeoa 
provlda  woll-daflaad  aad  ooatrollad  saaoa  for 
tba  propafatloa  of  axoaptloa  eoadltioaa 
batwoaa  aodulaa.  Xf  the  intarfaca  fuaetioa 
aaaoutloa  aoeouatara  error  eoadltioaa,  an 
arror  ooadltloa  la  raturnad  to  taa  caller 
throudk  tao  latarfaeo. 

(f)  Olacaoatle  Cbaeka:  Explicit  testa  ara 

eeaduetad  oa  ayatoa  eoapononta  Tor  unicn 
axpaetad  outputa  for  glvaa  teat  inputs  are 
kaoua.  The  ooapoaaata  to  be  tasted  and  tae 
ooapoaaata  eoaduetlag  testa  saould  be 

ladopaodaat.  Aa  pelatad  out  in  CiaOE79], 
dla^oatlo  ehooka  ara  rarely  uaad  aa  a  prlaary 
«>ror  dataotloa  aookaalaa;  but  uaad  ratbar  aa 
a  aupplaaaat  to  otkar  dataotloa  aaobaalaa. 

1$)  Tntarrel  Tlaar/Tlao-Oiit  Heokaalaaa:  Xa 

oaatrallaad  oyatoaa,  tka  latarral  Uaor 
taokaldua  for  arror  dataotloa  la  baaed  an  tba 
tla^out  ooaoapt.  Before  atartlac  aa 
aatlTlty,  tha  pro^m  atarta  aa  latenral  tlaar 
aat  ta  oortala  delay.  Xf  the  aotlvity  la 
ooaplacad  bafora  tha  tlaar  oouats  doua  to 
sore,  tha  oouater  la  raatartad;  othorviaa,  oa 
oeuatlac  taro,  tha  latarral  tlaar 

latarrupta  tha  prooaaa  ladloatlac  aoaa 
poaalhla  error  ooadltloa.  Xa  distributed 
ayataodt  tiaa  niif  tooaalouaa  ara  also  uaad  to 
dataot  poaalhla  arror  Qeadltloaa.  A  prooaaa 
loroklac  a  rooota  oparatloa  waits  for  a 
ipaalfiad  tlaa-out  period  to  raealra  taa 
reapeaaa.  Xf  aa  reapoaaa  la  rooelrad  within 
thla  porlad,  aa  aaoaptloa  ooadltloa  la  raised 
aad  appropriate  forward  error  raeorary  la 
laiuatad. 


3.2 


(b]  Ravaraai  Chaoka:  This  la  used  to  ebook  what 
the  laput  to  tha  ayataa  should  bora  baaa.  Tha 
calculated  Input  and  tha  aotual  laput  ara 
eoaporad  for  oeoalatooey. 

(e)  Codlnq  Chaoka:  Thla  la  tha  oost  popular  fora 
of  error  dataotloa.  Roduadaat  laforuatloa  In 
the  fom  of  cbeckaua  or  parity  la  associated 
with  objects  to  detect  errooaoua  states. 


Oopaodlac  oa  tha  way  a  eoaaaatent  ayataa  state  is 
rofanaratad,  arror  raoowary  taehalquaa  are  divided 
lato  two  broad  oataforlas:  error  reeevery 
aad  farwajw  error  raoerary.  Xa  backward  error 
raoowary,  a  prior  eoaslataBC  state  In  tha  execution 
history  la  raatored.  Forward  error  recovery 
teebalquos,  which  ara  applleatloo>dependant,  use 
the  present  error  state  to  arrive  at  sooa 
conalateot  state. 


'i,  Acceptance  Teata/Conalatency  Checks:  At  Tha  backward  error  recovery  requires  facllitiaa  for 

certain  well«daflaad  polats  la  tha  exaeutloo,  eatabllahlnd  reeevery  points  which,  after  craahea, 

testa  are  applied  to  the  objaota  to  aaaure  support  reoeastruetlnd  or  reatorlns  the  state  at 

that  the  state  at  that  point  oooforoa  to  tha  neat  reeaat  racevory  point  prior  to  the  crash, 

certain  apadfloatloas.  Any  laoooalataaalaa  Soaa  of  tha  toobalquoa  ua^  for  baeloiard  recovery 

laply  aa  arroaaoua  atata.  Cooolataaay  ahaoka  ore  doaarlbod  briefly  bolowi 

caa  also  be  applied  to  soao  nutUatod  data 
strueturea  that  are  reooaatruotod  oa  raeovary. 

<a}  Chaekpolaeiac;  Xa  thla  technique,  the 
(e)  Interface  Teats:  These  testa  ensure  that  the  coaplata  state  of  the  process  to  M 

interactions  aoona  systee  coapooents  oeet  checkpointed  la  saved  on  a  stable 

certain  acceptance  criteria.  Testa  are  a  process-oriented  desisn.  a  ****‘‘^®^“ 

applied  to  the  paraaatara  aad  tha  results  of  aavaa  oa  the  stable  storaae  the  eurr 

interface  functlona  to  llolt  propacstloa  of  of  all  tha  objects  bound  to  voraioa 

errors  free  one  eoapooent  to  soother  throudh  effect  a  checkpoint  creates  a 
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on  a  subi*  atoras*  of  ttio  coapXoto  oxoeutXoa 
ofnriroflMBC  of  Uto  preeoM  taat  oziatoa  at  ta« 
tiao  of  ehoeicpelatlac. 

(6)  CarofuL  RoplaoaaoaB :  This  tsehalQus  aselBs 
updacios  objoets  *ia  plaeo*.  Opdatas  aro  aads 
to  a  ■eurroat*  oopy,  aad  a  ■ahadaw*  copy 
aaiataiaa  tha  varalea  bafora  tha  updataa.  Oa 
coaaltaaat,  tha  "ahadaw*  copy  la  raplaaad  by 
tha  *eurraat*  copy. 

(e)  Muitipia  Varalaaa:  la  thla  tachalqua,  updataa 
to  Ob j acta  ara  raeordad  la  a  aaw  varsloa  that 
bacoaas  ourraat  oaly  oa  tha  eosBltMat  of  tba 
updataa.  Za  eaaa  tba  updataa  ara  to  bo  uadaaa 
(i.a.  abortad),  tha  aaa  varaloa,  uhlch  la 
uneoaBlttad,  la  dlacardad. 

(d)  Lo(S/Audlt  Trail:  Za  this  taohaldiia,  aocloaa 
parfonod  oa  aa  objaet  ara  racer  Bad  la  a  lac 
or  audit  trail.  Tha  purpoaa  of  tba  loca  la  to 
support  althor  jiadA  of  tba  lofgad  aetloa  for 
atata  rollback  or  xiUA  the  loaad  aetloa  to 
anaura  panaaonea  of  raaulta  produood  by  acaa 
coaalttad  traaaaetloa.  Laca  that  ooatala  tha 
rado  aetloaa  ara  eallad  tha  lofa: 

logs  that  raeord  tha  uado  aetloaa  ara  cailad 
tha  haeUMtvi  legs.  Tha  baeiaiard  logs  althor 
raeord  tha  lararao  oparatloas  or  the  aaluaa  of 
tha  objaet  bafora  tho  applloatloa  of  tha 
loggad  aetloa.  Ourlag  a  raoeaary  praeaaa, 
baekward  lag  la  uaad  by  aaamttwg  It  baelwapds 
for  uadolag  aetloaa  la  a  last-la*  flrst-out 
faahiaa.  Tha  foUdslag  ja:i£giigggg  rula  la 
alNcya  follovad  to  aoaara  reeoaenr:  1)  feroa 
ua  uado  log  ea  tho  atahla  starags  befopa 
updauag  aa  objaot  la-plaos,  U)  fopoo  tha 
rado  leg  ea  tho  atahla  stopoga  hafopo 
oooBietlag  aa  update. 

(a)  Olfforaatlal  FUaa:  Za  thla  tachalqua*  all 
updataa  to  aa  objaet  ara  raeordad  oa  a 
dlfferaatlal  fUa  CSSTETb].  Tho  updates  freo 
tba  dlfferaatlal  file  ara  aargsd  parlodloally 
into  the  aaln  copy  of  tha  objaet  cad  aueh 
updates  ara  thaa  delated  frea  tho  dlffaroatlal 
file.  Tha  dlfferaatlal  file  taohalqoa 
proTidoa  aa  laaxpoasiTO  aaaaa  of  aaiatalalag 
aultlpla  varaleas  of  a  largo  ehjsot. 
Twe-wtiawa  iiata  are  t  fotm  Of  dlffapoatlal 
fllaa  or  forward  logs  eoatalalag  rode  aetloaa 
that  raeord  tba  new  values  of  tba  objaets  aad 
nave  tba  property  of  Idaapotaaey.  Tba 
property  of  Idaapotaaey  lapllas  that  rapaatod 
axaeutloaa  (aeaa  of  whloh  aay  ba  laoeaplato) 
of  this  aaquaaea  of  aetloaa  would  always  brlag 
tba  updated  objaet  to  tha  aaaa  atata. 

( r)  Priaary/Baekup  Heda  of  Oporatloa:  Zf  aa  error 
la  dataetad  during  tha  laveeatloa  of  aeaa 
aarviea  supported  by  the  prlaary  objaet*  a 
backup  object  provldas  a  eoatlauatloa  of  those 
aarviees  startlag  with  aoaa  provloua 
eeaalstaet  atata.  Tba  backup  ebjaot  aay  aot 
be  idantleal  to  tba  prlaai^  object.  Tha 
taeaalque  of  raeawarv  .hlOGiU  ChOMT*]  Is  aa 
axaapla  of  latagratlag  thaas  ooaeopts  Into 
software  arehltaeturaa.  A  prlaary  block* 
along  with  oaa  or  aera  altaraata  bloeks  aad  aa 


aeooptanea  teat*  foraa  a  recovery  block. 
First  tbs  prlaary  block  Is  exeeutad  and  tba 
aeeaptaaoe  test  la  applied.  Zf  the  aeeaptanea 
aueeoada,  tba  recovery  block  tarainates 
aueoaaaf ully ;  otbamlsa.  tba  next  alternate 
bleek  la  axaoutad  with  tha  state  of  tba  systaa 
rostorad  back  to  tha  oaa  that  existed  before 
tho  applloatloa  of  tha  previous  block. 

^g)  Object  Raplleatloa:  Thla  tacbnique  saintains 
aultlpla  copies  of  an  objaet  at  different 
alias  to  laeraaas  its  availability.  At  least 
a  aurrlvahla  subset  la  always  kept  in  tne  aost 
up-t^data  atata.  This  sat  la  chosen  such 
that  tha  probability  of  all  aaobars  of  this 
sat  being  la  tha  eraahad  stata  is  very  low. 
laeh  sets  ara  eallad  tba  .ipnete 

[MZX002].  Tba  aaaaaoo  of  this  principle  is 
roflactod  la  aoaa  of  the  raplleatloa 

aaaagaaoat  eehMoa  that  have  appeared  la  tba 
lltaraturo.  Tha  alaplast  la  tba  aajorlty 
update  rule  (TB0H79]  propeaod  basically  to 
addraaa  tha  eeaourraaey  eoatrol  probleo.  A 
goasrallxatloa  of  thla  aebma  Is  tba 

walghtad-votlng  sohaoaa  proposed  by  Slfford 
[QZ7FT91  aad  Skaoa  [SXXS82]i  where  every 
rapllea  of  aa  objaet  la  aaalgnod  sooe  aiabar 
of  vetaa.  Tho  rules  for  aeoaaaiag  or  updating 
the  ropUeatad  ohjacc  arc  baaod  on  aequinag 
aufflalaat  votes  (l*o.  forslng  a  quoruo)  In 
tho  ^rataa.  All  asobara  la  tha  quoruo  ara 
updated  ataalaally.  >y  ehaaglag  the  rules  for 
foralag  querw  for  oporatloaa,  dlffaraat 
rcllahUlty  aad  parforBsaca  Isvals  can  ba 
attalaad. 

(h)  Solf-Zdaahlfylag  Object:  Za  thla  tachalqua 
aultahla  doaarlptcra  ara  attacbsd  to  tha 
objoeta  to  facUlcatc  raooaatruetloa  of 
dlractarlas  by  aalvatloa  propaoa.  Salvatloe 
program  ara  uaad  oaly  la  eaaaa  of  axtreoa 
failures  where  net  aaeugb  Inforaatloa  is  left 
la  a  oeaalstaat  atata  to  support  sutoaatic 
rollback  aad  raatart.  Such  progrias  need 
operator  latarvoatlon. 

Qaaorally*  ovary  raliahla  systaa  daalga 
laaorporatoa  both  forward  aad  baclward  error 
racovary  toohaiquaa.  Tha  aoat  ooaaoa  tachalqua  for 
forward  arror  racovary  la  eweenti  aw 
[G00D79].  Excaptlea  oeadltleaa  ara  tha  aatleipctad 
arror  ceadltloaa  la  tha  systaa.  Aa  eveention 
hendie*.  Is  s  progTm  hlock  tact  is  invoked  when  s 
spaelflad  axoaptloa  ooadltloa  srlaaa  during 
ru^tlsa.  Tha  purpoaa  of  tho  exception  haadlars  la 
to  bring  tha  syatao  to  a  ceaslstant  stata. 
Oaaarally  tba  axoaptloa  haadlara  ara  application 
spoelflc.  Forward  arror  recovery  requires  a 
OQoplato  uadarstaadlag  of  tha  applloatloa  for  which 
tha  syatao  la  bolag  doslgaad.  Zn  this  paper  wa  do 
aot  oeaaldor  thaas  tocholquoa  la  aay  acre  details. 


«.0  Tafaratlno  af  ttim  b«llanilltv  WocnsiUMa  In 


Za  this  aaotlea  wa  dascrlba  tba  reliability 
tachalquas  that  ara  suitable  at  each  level  of 


142 


•MCrAotloa  la  CM  daalca  aodaX  laoMB  la  Flgara  1. 
nm  ataeuaaioa  la  aiTldaa  iato  foor  aajar  partai 
objaot  aaaagMMtt  traaaaotloa  aaaacaMaCt  raaota 
prnrafiiira  cftlXat  aod  ta#  aaaafM^aa  oF  dlaaplbuiaa 
aOjaaea.  u«  focua  oa  tha  proalasa  raiabad  ta 
raaovary  racbar  taaa  protaetiaa  aad  aaeurltT 
isauaa. 


5.1 

Laapaoe  praaaatad  cba  taoanlquaa  for  eooacrueclac 
subXa  scorM*  froa  unrallaaia  dlae  9Cora«a 
faculty  CUMPeia].  Tha  prlaary  CMl  of  bla  acbMa 
la  to  »«*««  taa  oparatloa  of  wrltlnc  diao  pacaa 
atoKie. 

Laapaea'a  aebaaa  la  baaad  oa  taa  taehalqua  of 
earaful  panlaeaMBt.  Tha  dtOBlO  oparatioa  for 
«ntla<  pagaa  oa  taa  aoa  volatUa  atoraca  la  eallad 
Staaiafttt.  Tba  Stablafac  oparatloa  flrat  wrltoa 
tao  pafo  oa  aa  uauaad  dlae  pa«a  rmtaor  taaa  wrttlac 
It  oTor  taa  orlflaai:  taaa,  aay  faUura  durlac 
aneutloa  of  taa  Stablafut  oparatloa  laavaa  tao 
origlBal  pa<a  lataet.  Parlodleally  tao  two  pofoa 
ara  eoaparad.  aad  taa  old  pa«B  la  raplaoad  by  tao 
now  oaa.  Tha  pa«aa  ara  alao  ebaelcod  for  aoy 
eemiptloB  of  data  by  applylac  aultabla  parity  or 
caoekauB  toata.  Thm  oorrup^  pa«o  la  ra^doad  by 
taa  data  of  taa  otaar  pa^a  If  taat  pa«o  la  atlXl 
uoapeliad.  Thla  f etigagiaa  of  pa«oa  alao 
leeraaaoa  taa  aaaUabUlty  aad  aaaa»tlaa»to»faUMra 
for  taa  pa«oa«  profldad  tba  pataa  ara  storad  oa 
dlfforaat  atoraca  oaita  auoa  taat  taoir  faUura  la 
lodopandant. 

4.2 

Aa  objaot  aana«ar.  aupporta  prlaltlra  oparatloaa  oa 
tAo  Ob J acta  of  Ita  typo,  aa  wall  aa  otaar  fuaotloaa 
auea  aa  tba  eoaatnietioa  of  raeoaarabla  objaota, 
coaeurraoey  control,  aad  aeeaaa  ooatrol.  Objaota 
for  wAlcb  raeoaary  and  ayacbroolzatloa  ara  proaldad 
by  tba  objact  aanapar  ara  eallad 
CLXSKa2b]. 

Ganaratloa  of  OXSa  la  aa  laportaot  part  of  rallabla 
objact  aanagaaaat.  A  eraab  raaiataat  aebMo  for 
g*  ratlaq  UXOa  In  tba  ayatoa  daaerlbad  la 
CSCHaSB],  In  tnia  aebaaa,  avary  M>dr  In  a  aubaat 
of  oodaa,  wnicb  forma  a  auraivabla  aot,  muat 
poaaaaa  a  acabia  atoraca  faculty.  A  cl^bal 

saquanca  countar  la  rapllcatad  oaar  thla  aubaat 
and 

aoaatiaaa  flobal  aynebroalxactoa  aaoac  tboao  nodoa 
la  raquirad.  Oa  raatart,  nodoa  not  haalaf  a  atahla 
atoraca  obtain  tba  aoquoneo  nioibor  froa  oao  of  tao 
maabara  of  thla  aurrlvabla  act  of  nodoa. 


Baeaaaraaia  Qhlaata  .  CoOeoptuaUy.  OOaatrvdUaC 
raeoaarabla  objaota  la  our  doalff  oodol  la  baaod  oa 
tba  nultlpla  varalon  taeaniquoa  CRCES78],  CSVOB61]. 
£vary  cnanca  to  aa  objact  craataa  a  now  varalon  of 
tnat  Objact;  aueb  varalona  arc  flnallzad  upon 
toaalttlnc  tba  anoloainc  tranaaotlon.  On 
tranaactlon  aborta,  tba  tantatlva  varalona  eraatad 
by  that  tranaactlon  ara  dlacardad.  Tha  raraiona 


arm  foreod  onto  tbo  atablo  atoraca  to  aako  taaa 
raooaorablo  uador  aodo  eraaboa. 

lollabUlty  taebalquoa  ooat  aultabla  for 
eoBatnietlBc  raooaarabla  objaeta  ineluda  aultipla 
voraloaa,  dlffarantlal  fUaa,  intention  liata. 
audit  traUa/loca,  and  aalf-ldantlfying  objscta. 
CoBorally,  a  eoablnatlon  of  aavarai  of  -..-.eae 
taebalquoa  la  uaod  la  conatructinc  recoverable 
objoeta  at  a  aodo. 

It  la  loan  axpanalvo  oolntalnlnc  aultiple  veraiona 
aa  a  dlffarantlal  fUa  rather  than  aa  copiaa  of  tna 
orlclaol  object.  A  differential  file  10  w men  tae 
aoqueaoe  of  eboacea  la  Ideapoteat  can  be  uaec  aa  an 
latootloa  Hat  to  eaaure  tbo  peraanenee  of  reaulta 
oa  tbo  rnoBl  raanf  of  a  traaaootloa.  Backward  loca 
ax^  uaod  for  reatorlac  objaeta  by  undoinc  the 
aetloao  rooerdad  la  tbe  leg.  Whenever  a  new 
uaoeaolttod  atate  of  aa  objeet  la  to  be  forced 
la>plaaa  oa  tbo  atablo  atorage  frea  tbe  volatile 
aaaory,  it  la  eaaoatlal  that  (la  order  to  keep  tbe 
objeet  recoverable)  tba  haekwerd  log  be  forced  on 
tbo  atablo  atorage  bafore  ferclng  tbe  uneoaolttad 
object  lo>plaeo  oa  tbe  atable  atorage. 

Self'ldoBtlfylng  objaeta  and  eonaiateney  ebecka 
play  aa  laportaat  role  during  raatart  after  a  eraaa 
la  roooaatructiag  objaeta*  object  baadara  and 
dlrectonea  durlag  tbo  raatart  after  a  eraab.  For 
oBBaplo*  wltb  aultiple  voraloaa,  dlfferoatlal  fUea 
aad  ioga,  addltloaal  loferoatloa  oueb  aa  tbe  objeet 
Oil),  atata  of  tbo  voraloaa  (eoaoltted,  uaeooaltted. 
ocoolt  poadlag,  ate.},  polatara  to  otber  veraiona, 
logo  aad  dlfferoatlal  fUea  la  inenrporated  for 
eraab  recovery.  After  reeoaatruetlng  tbe  data 
atrweturoa  oa  eraab  recovery,  tbe  eonaiateney 
ebecka  arc  laportaat  la  ebecklng  tbe  validity  aad 
eerroetaaaa  of  tbo  reeoaatrueted  data  atrueturea. 

At  tbla  point  we  deaerlbe  a  aeb«e  aad  ita 
aaaeelatod  data  atrueturea  for  aalntalnlng  aultiple 
veraiona  la  the  ayatoa  to  eenatruct  recoverable 
objecta.  Logically,  every  veralon  la  thla  acneae 
ooatalna  a  deacrlptor  which  eentalna  tba  UU)  of  :ne 
object,  veralon  nuaber,  OZO  of  tha  tranaaetion 
currently  bolding  tbla  veralon,  a  tlae-ataap 
ladleatlag  Ita  creatlea  tlae,  aad  a  atatua  of  tne 
veraloa.  The  atatua  field  can  be  la  any  one  of  the 
following  atatea:  impnaaii -.f  .rt.  ».n»rp-  -T 

rammiurmn.  and  aborted. 

The  tlae-ataap  field  of  the  veraiona  la  uaeful  for 
dlacardlng  tae  veraiona  created  by  a  tranaactlon 
ainoe  ltd  laat  ebeekpoint.  The  eoaolt- pending 
atata  la  uaod  during  execution  of  the  two-phaae 
cooBlt  protocol  [LAUFTB],  CGRAX79]  with  tha  current 
uaer  traaaaetloB.  The  eoaolt  protocol  la  initiated 
by  tba  uoor  tranaaetion  by  aendlng  a 
gCUlCfc£fisfiflHL&  oeeange  to  the  objeet  oanagera  of 
all  tbo  objaeta  It  boa  updated.  Oa  receiving  auch 
a  oeaaago,  tbo  object  oanagera  change  the  atatua 
field  of  tbe  eurrent  veraiona  to  eoaolt- ponding, 
and  return  a  poaitlvo  aeknowledgaaent.  A  veralon 
in  the  eoaolt- pending  atate  cannot  be  umiaterally 
dlacardad  by  Ita  object  oanager. 

In  tba  achaoe  propoaod  here,  wo  uae  differential 
fUoa  to  maintain  aultiple  veraiona.  Tha  veraiona 
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of  on  oOjoet  oro  MiatalaMI  ia  a  ttffaraatlal  fila 
M  raeoraa  of  eaaafaa  ta  taa  aalatlag  eaaalttad 
eopr  of  that  objaet.  Apalylag  thaaa  ehaagaa  to  tha 
objaet  baa  tha  prepartjr  of  tdaapotaaey;  tharafora, 
tha  dlffaraatlal  fllaa  alao  aorra  aa  lataatAoa 
llata.  For  ovary  traaaaetloa,  cao  aueh  rUa  ia 
eroatad  aa  ahoMa  ia  Flgura  2.  Tha  fUa  coatroi 
block  (FCB)  piaya  aa  iaportaat  reia  ia  thia  aahaaa. 
Tha  FCS  for  a  oiffaraatiai  fila  haa  two  partai 
Curraat  Traaaaetioa  Oaaeriptor  and  Fhyaieal  Storagb 
Hap.  Curraat  Traaaaetioa  Oaaeriptor  ooataiaa  tlw 
idaatifiar  aad  tha  statua  of  tha  traaaaetioa  that 
baa  raeordad  aaa  uaeooaittad  voraioaa  of  tha  objaet 
ia  tha  dif faraatiai  fUa.  Fliyaleai  Storage  Mp 
peinta  to  tha  raeorda  oa  tha  atabla  atoriifo 
eoataiaiag  tha  updataa  for  tha  ami  voraioaa.  hy 
roMritiog  thia  Fd  uaiag  tha  ateaie  Stabiafut 
oparatioa  tha  aatira  FCI  oaa  bo  ntiangail  la  oaa 
ateaie  aetioa.  Thia  uao  of  FOI  for  ateaie  updbtas 
la  aiaiiar  to  the  aehaaa  daaeribod  ia  [LOt^] 
[FAXIT?].  To  raoerd  aa  aetioa  oa  tha  fUa«  tha 
changaa  are  writtaa  oa  aaa  pagaa,  tha  Fd  ia 
aodifiad  aad  ra>writtaa  uaiag  tha  Stab^oPut 
oparatioa.  At  thia  poiat,  Ua  ehaage  hah  baaa 
aueeaaafully  raeordad. 


4.3  fcflQggg  gail  XCIfiggCtifiA  limOHHtt 

la  thia  aaetloo.  wa  dlaeoaa  tha  oaa  of  rallihUlty 
^aebaiquoa  to  iapleaaat  rallahla  proeaaaaa  aad 
traaaaeuoaa.  U  aatad  ia  SaoUoa  3,  proooaaaa  aro 
eoaaidarad  aa  objaota.  Traaaaotloaa  ara  ateaie 
proeaaaaa;  traaaaetiooa  ara  ebjoeta  of  prooaoa  typo 


OMOOry  fMry  )ar »  O^aa 


aith  aooa  additiooal  propartiaa;  tharafora, 
traaaaetioa  typo  ia  a  aub-typo  of  proeaaa  typo. 

A  proeaaa  objaet  oaea  eraatad  can  ba  in  one  of  flva 
atataa:  Zaaetivo,  Ruaoing,  Suapandee,  Coaplatad, 

or  Abortad.  Tha  oparatloea  for  proeaaa  oojeeta 
iaeiuda:  Craata,  Oaatroy,  Start,  Raatart,  Statua, 
Suapaad,  aad  Raaiaa. 

Froa  aa  objaet-oriaatad  viawpoint,  tiaw  varalona  of 
a  proeaaa  objaet  ara  eraatad  during  axacution  of 
tha  prooaaa,  i.a. ,  a  aoa  varaioa  of  tha  proeaaa 
objaet  ia  eraatad  whaaavar  ita  prograa  eountar 
ehaagao.  Thia  vioa  ia  ooaaiataot  with  tha  oaa  for 
aultipla  voraieoa  of  data  ebjoeta.  Thara  la, 
hoMOvar,  a  diffaraaea  botwaaa  thaaa  two  typaa  of 
objaota  ia  haadXiag  rollbaek  roeevary:  with  data 
objaota.  it  ia  uauaily  peaaibia  to  aava  ail 
veraiaaa  of  aa  objaet  before  thaaa  varaioaa  ara 
eeaaittad  ao  that  rollbaok  to  a  pravioua  varaioa  ia 
ralativaly  aiapia;  with  prooaaa  objaeta,  it  la  too 
axpaaaiva  aad  iapraetieai  to  aava  tha  proeaaa 
atataa  of  ail  axaeutioa  atapa.  Cbaekpolnting  eaa 
ba  vlowod  aa  tha  aaiaetiva  aaviag  of  varaioaa  of 
proeaaa  objaeta  aad  ia  uaad  to  aatabilah  recovery 
peiata  for  preeoaa  objaeta. 

The  feiiowiag  oparatioaa  for  prooaaa  objaeta  are 
uaad  to  aupport  ehaekpeiaUag  aad  roiibaek: 

0  taUhlia>Ulaeewary,Jeiat  -  aavaa  tha  eurraat 
proeaaa  atata  of  tha  proeaaa  objaet  ia 
atabla  ateraga. 

a  ntaear<Ulaeovery_FeiBt  •  diaearda  tha  eheekpeinta 
of  a  prooaaa  objaet. 


Ofmoryfmrj 


nouAiJ 


144 


o  Rollback  -  contlnuaa  tta*  txacutlon  of  a  procass 
froa  a  cbockpolnt. 

NoCe  that  with  tba  above  Sctaaao  for  cbackpointlng, 
only  the  state  of  tbe  process  object  Is  saved;  the 
states  of  objects  aodifled  by  tbat  process  are  not 
saved  in  the  ebeokpoiat.  This  approach  aay  create 
problMs  for  error  recovery  since  not  all  state 
Changes  of  tbe  process  are  recorded.  It  is 
necessary,  therefore,  to  follow  soae  discipline  in 
losing  checkpoints  and  atoaic  transactions. 

First,  we  require  that  a  non-transaction  process 
(e.g.,  a  user  process)  must  invoke  a  tranaaotton  in 
orcer  to  aodify  an  object  or  a  set  of  objects.  The 
changes  to  an  object  are  recorded  as  new  versions 
of  the  object.  New  versions  of  the  object  are 
coMitted  to  becoae  peraanont  at  tbe  end  of  a 
successful  coapletion  of  the  eoaait  protocol  aaong 
the  invoked  transaction,  the  invoUng  process,  and 
the  object  aanagers  of  the  aodified  objects. 
Uncosnitted  versions  are  discarded  by  explicit 
abort  coonands  froa  the  transaction  process  or  by 
tiaeout  on  inactivity. 

If  a  transaction  is  nonideapotent,  i.  a. ,  multiple 
executions  of  the  transaction  produoe  different 
results,  a  problea  aay  arise  in  error  recovery 
since  rollback  of  the  proceaa  aay  causa  a  ccaaittsd 
transaction  to  be  ro-exscuted.  Oas  solution  to 
this  problaa  is  to  always  force  tbe  invoking 
process  to  perfora  a  checkpoint  before  the 
transaction  coapletea  eoaaictiag  tbe  aodifie-'' 
Objects,  Checkpointing  is  part  of  the  ooaait 
protocol;  if  tbe  protocol  dateraines  to  abort,  tbe 
checkpcint  is  discarded.  With  this  aandatory 
checkpoint,  rollback  recovery  of  a  process  can 
avoid  undesirable  repetition  of  transaction 
execution:  however,  this  aay  cause  too  frequent 
checkpointing  of  the  invoking  process.  Tbe  second 
solution,  therefore,  is  to  aake  checkpoint  of  the 
invoking  process  an  option  that  is  to  be  specified 
at  the  tine  of  invoking  a  transaction.  This 
checkpoint  apparently  is  not  required  for 
Ideapotent  transactions  to  guarantee  correct 
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execution;  a  process,  however,  invoking  an 
ideapotent  transaction  nay  elect  to  force  a 
checkpoint  during  the  transaction  eoaait  protocol 
for  efficiency  reasons.  For  exaapla,  if  the 
transaction  requires  extensive  coaputation  coapared 
to  checkpointing  the  invoking  process,  and  if  the 
possibility  of  a  failure  is  significant,  it  say  be 
desirable  to  have  a  checkpoint  as  described  above. 
That  decision  is  left  to  the  process  that  invokes 
the  transaction. 

The  following  example  Illustrates  the  flexibility 
proviced  by  the  second  solution.  Consider  the 
following  scenario  in  which  a  process  receives  soae 
itea  froa  a  buffer,  then  processes  the  item. 
Getitaa  is  the  transaction  that  is  invoked  by  the 
process  to  receive  an  itea  froa  the  buffer. 
Coaaitaent  of  this  transaction  leaves  the  buffer  in 
a  aaw  state  in  whteb  the  rMOvsd  itea  is  no  longer 
present  and  tbe  previous  state  can  never  be 
restored.  If  tbe  process  invoking  this  transaction 
cbeckpolnta  itself  on  tbe  coMiitaent  of  the 
transaction,  tbe  received  itea  is  a  part  of  the 
checkpointed  state  of  the  process.  Any  subsequent 
rollback  will  restart  preeoaalng  of  this  saved 
itea,  and  there  will  not  be  aiqr  need  to  re-invoke 
tbe  Cetitaa  transaction.  On  tbe  other  hand,  if  the 
process  does  not  cbsckpolnt  on  eoaaitting  the 
GetXtaa  transsetion,  on  a  subsequent  rollback  tba 
old  itea  would  be  lost;  the  prooass  would  inveks 
tbe  GetZtea  traasactien  onoe  again;  and  processing 
would  bo  perforaed  on  a  mw  itea.  In  certain 
applications  such  as  process  control  systaas,  tbe 
second  soenario  aay  bo  a  valid  aode  of  error 
recovery. 

A  question  on  obeekpointing  still  exists:  Because 
tbe  obeekpoint  of  a  proooss  does  not  iaolude  the 
current  states  of  the  objects  tbst  are  aodified  by 
tbe  process,  hew  does  it  guarantee  correct  rollback 
recovery?  This  question  can  be  answered  by 
considering  tbe  ways  in  which  objects  are  affected 
by  a  transaction:  first,  a  transaction  aay 
directly  aodify  an  object  by  invoking  an  operation 
on  the  object;  second,  it  aay  invoke  another 
transaction  tbst  aodifies  tbs  objeot. 

Ho  first  consider  tbe  object  aodified  by  tbe 
transactions  by  invoking  an  operation  on  tbe 
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object.  On*  requlreaent  for  eorroetly  laploaoBtioc 
roIXboek  is  that  tb«  object  nnscor  for 

transactions  Bust  aalntalB  for  eacB  transaetloB  a 
list  of  UlDs  of  tBo  objoets  that  arc  aff acted  by 
the  craaaaetlOB  so  that  la  CBe  eveat  of  failure, 
tbe  transaction  will  be  rolled  back  to  Its  latest 
cneckpoint.  This  requires  that  all  ehanca* 
objects  Bade  by  the  transaction  after  the 

cnecKpoint  be  discarded.  The  list  of  UXDs  of  the 
Objects  that  are  affected  by  the  transaction 
?rov;.oea  a  scans  for  notlfylac  these  objects  to 
discard  the  unwanted  versloas.  This  list  Is  also 
used  at  the  end  of  the  transaction  to  conduct  the 
tcBSit  protocol. 

The  discussion  in  tbe  preeedin«  paragraph  lapllss 
that  for  laplaMBClnc  rollbadc,  tiasstsaps  aust  be 
recorded  with  each  checkpolat  aad  eacii  eersloa  of 
objects.  This  la  oecesaary  because  ehackpolats  do 
not  record  all  wersioaa  of  a  preosss  object  aad 
thus  there  is  not  a  oos-to-oas  aappins  between 
process  chackpointa  and  versloas  of  objects 
affected  by  the  process.  These  techaiquss  allow  us 
to  rollback  a  process  correctly  with  aodiflcatioas 
to  objects  In  the  first  way. 

The  correctness  of  rollinc  back  a  procoas  with  the 
secohd  way  of  aodifieatioas  to  objeota  is 
suaranteed  by  the  priaciplss  followed  ia  oaaBittiac 
a  nested  traasaetloa.  The  updates  aada  by  a  aaatod 
transactloa  are  asds  peraaasat  only  If  its  pareat 
traasaetloa  la  eaaaittad.  M  rollback  witkia  a 
traasaetloa  asy  oauao  sbortiea  of  aaao  ooaaittsd 
nested  traasactloas.  Xa  ease  a  traasaetloa  is 
aborted,  tbe  ehaafas  aade  by  the  traasaetloa  aro 
discarded  (traasaetloa  are  atoaie).  Tbs  objects 
are  brou«bt  back  to  the  state  before  tbe 
transactloa  was  started.  Failure  of  tho  lavokiac 
process  poses  no  problsas  to  these  objects. 

For  idsapetent  traasactloas,  the  ease  is  evea 
sispler.  Because  transactions  are  atoalc,  ebaaces 
to  objects  are  either  not  done  or  aada  perasasat 
free  the  lavokiac  process  poiat  of  view;  aad 
because  the  traasaetlona  are  also  Ideapeteat,  tbe 
rollback  recovery  Is  always  oorroct,  ao  aattor 
where  the  lavokiac  process  Is  rolled  back  to. 


Nested  Traaaaetions  -  As  described  la  the  previous 
section,  at  the  end  of  a  successful  traasaetloa  tbe 
ocjects  that  were  chanced  by  the  tranasctlon  are 
coaaltted  to  beeoae  perasasat;  however,  for  a 
nested  transaction,  (a  transactloa  lavokod  by 
anotner  transaction)  that  coapletea  successfully, 
coaaitaent  of  the  chaoces  to  objects  will  be 
dependent  on  the  success  of  tbe  pareat  tranaactioa. 
If  a  nested  transaction  Is  aborted,  the  chaaces  to 
objects  aade  by  the  traasaetloa  will  bo  disoardod 
regardless  of  the  suooeas  of  Its  parent 
transaction;  however,  the  failure  of  a  nested 
transaction  aay  not  always  cause  Its  pareat 
transaction  to  abort.  In  this  seetloa,  we  will 
describe  a  technique  for  lapleaeatlac  nested 
transactions.  This  techalque  requires  only  alnor 
aodificatlons  to  the  technique  for  lapleaeatlnc 
Single  level  transactions. 


Tbe  tachaiquo  that  we  use  here  requires  recoverable 
objects  as  described  In  Section  a. 2,  i.e. ,  each 
update  to  an  object  creates  a  new  version  of  the 
object.  Each  version  carries  the  inforaatlon  to 
iadicata  whether  It  la  a  oaaaitted,  unoaosltted,  or 
eoaait-poadiBC  veraiea.  Xa  order  to  support  nested 
traasactloas,  additional  inforaatlon  is  needed  for 
each  version  to  Indicate  on  which  transaction  the 
version  is  dependent.  This  Inforaatlon  is  attacned 
to  a  version  when  It  is  created  by  a  transaction. 

At  the  ead  of  a  transaction,  the  transaction  either 
coaalts  or  aborts  the  changes  to  tbe  object.  If  it 
aborts,  only  the  uacoaaltted  versions  that  are 
dependent  oa  this  traasaetloa  are  discarded.  If  it 
eoMita,  all  versloas  of  the  object  that  are 
dapaadaat  oa  this  traasaetloa  are  ehaagad  to  beeoae 
dopaadoat  ea  the  pareat  traasaetloa  of  the  eurreat 
traaaaetioa.  Xa  this  ease,  if  the  current 
traaaaetleB  is  at  the  top  level,  i.e.,  if  it  la 
invoked  by  a  bobi- traaaaetioa  process,  these 
versloas  are  eoaaltted  to  be  peraaaent. 

Figure  d  shows  aa  esaaple  of  nested  transactions. 
Traaaaotiea  T1  updates  the  object  X  to  create 
VealeBS  1  aad  2.  Beth  versiOBa  are  uBeeaaitted  and 
ooBtaia  the  lafomatioa  that  they  are  depeadeat  oa 
Tl.  A  logleal  view  of  this  result  la  shown  ia 
Figure  5(a}.  Tl  then  laeekes  traaaaetioa  T2,  which 
oreates  Veraioaa  3  aad  A  of  X  (Figure  5(b}}.  When 
T2  ia  eeaplsted  sueeesafully,  all  versloas  that  are 
depeadeat  ea  12  aro  ehaagad  to  be  depeadeat  on  Ti , 
the  pareat  traaaaetioa  of  12  (Figure  5(e)).  Since 
Tl  is  a  te^lsvel  traaaaetioa,  i.  e. ,  it  is  invoked 
by  a  aea* traasaetloa  preeeas;  whoa  it  is  eoapleted, 
all  varsleas  that  are  depeadeat  ea  TT  are  oeoaitted 
to  beeoae  peraaaent  (figure  5(d)).  Tersloa  t  of  X 
is  aew  the  eurreat  eaaaittad  copy  of  object  X; 
other  versiOBa  eaa  bo  diaearded  at  this  poiat. 

Xa  order  to  lapleoeat  the  above  sehaae  for  nested 
traasaetloaa,  we  can  use  differential  files  to 
aalotain  aultiplo  versloas  as  described  In  Section 
4.2  aad  Figure  2.  Xa  the  axaaple  la  Figure  4,  for 
a  heated  traaaaetioa,  a  new  FCB  and  descriptor 
blook  is  creatad  as  ahowa  in  Figure  3.  When  a 
aasted  traasaetloa  eeaplotes,  the  traaaaetioa  OXS 
field  in  the  descriptor  of  the  version  that  is 
being  eoaaltted  Is  replaced  by  the  UXS  of  its 
parent  transaction,  and  the  status  field  la  changed 
to  tbe  uneoHitted  state.  The  status  field  of  a 
version  changes  to  eoaaltted  only  when  the 
traasaetion  oeoaitting  it  is  the  eutemoat  level 
transactloa.  Xa  Figure  3,  transaction  T2  is  nested 
within  traasaetloa  Tl,  and  Tl  created  versions  Xi 
and  X2  for  object  X.  Tranasctlon  T2  appends  new 
versions  X3  aad  X4  to  the  differential  fUe,  aad 
these  ehsngaa  are  visible  only  in  the  FCB  that  is 
baiag  used  for  traaaaetioa  T2.  Oa  tbe  coonitaeat 
of  T2,  the  old  FS  is  replaoed  by  the  aow  oaa,  aad 
the  user  traaaaetioa  field  eeatalaa  Tl.  Ubea  T1 
eoaoits,  tbs  status  field  ia  the  descriptor  is 
ehaagad  to  ceoaitted,  aad  the  updates  froa  the 
differential  file  are  applied  to  the  object.  If 
any  crash  occur a  during  this  updating,  the 
procedure  eaa  be  restarted  frea  beginning. 
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U  .4  Bemot.e  r.alla 

The  probiCBS  releted  to  reliable  reaote  procedure 
calla  have  been  dlaeuased  la  [LAMMlb]  and 
[SHRI82a].  One  problea  aaaeolated  Mltta  Uxe 
lapleaeatatloa  of  reaota  procedure  oalla  la  their 
execucioB  aeaantlca.  In  caae  of  a  recraaaaltted 
reoueat  aessage,  should  repeated  ezecutloaa  be 
peraitted?  To  address  this  problea.  Nelson 
has  classified  the  seaantlcs  of  reaote 
procedure  u.lls  as  follows: 

c  'At  sost  once”  •  In  this  seaantlc,  at  aost  one 
execution  of  the  procedure  takes  place.  It  is 
possible  that  no  invocation  occurs.  In  this 
case  the  call  returns  with  soae  error  condition. 

o  ”At  least  once*  -  This  aeaaatle  aeaas  a 
successful  return  froa  the  call  guarantees  at 
least  one  execution  of  the  procedure. 

In  aost  of  the  applications  *at  aost  once*  Is 
preferred.  One  problea  In  the  lapleaentatloa  of 
”st  aost  once*  Is  detecting  duplloate  requests  at 
the  server  end.  If  the  client  preoess  orashes 
after  sending  the  call  request  and  retraasalts  the 
request  after  the  restart,  the  server  should  be 
able  to  detect  the  duplloate  request.  Per  this 
purpose  the  OZO  facility  Is  used  to  aaaign  a  unique 
naae  to  the  request. 

If  a  requester  orashes  after  the  server  has  started 
the  procedure  exeeutioa,  the  proeedure  inveoatien 
is  teraed  an  "orphan*,  if  tar  a  restart  froa  the 
crash,  the  requester  preoess  vUl  retraasait  the 
reaote  procedure  call  request.  At  this  point  we 
have  two  options  In  tbs  design.  The  first  option 
la  to  rotraasalt  the  request  with  the  seas  QZB  as 
was  used  for  the  initial  oall  request.  XT  the 
original  request  was  lest,  this  rotransaitted 
request  will  Invoke  the  reaote  proeedure.  Zf  the 
server  received  the  original  request  and  started 
proeedure  execution  that  was  later  rendered 
■orphan*  due  to  the  requestor  oraah,  the  server 
would  detect  the  duplicate  request,  oontinuo  the 
■orphan*  exeoutloa  which  is  no  aero  an  orphan,  and 
return  the  results  of  the  "orphan*  to  the  restarted 
requester.  This  seboae  requires  that  every 
requester  process  aust  have  access  to  a  stable 
storage  facility  to  store  the  request  along  with 
Its  ull  so  that  on  a  restart  the  retranrvttted 
request  has  the  sane  DID.  Because  of  this 
imitation  we  reject  this  scheae  and  propose  the 
second  scneae  in  which  every  reaote  prooedure  call 
IS  an  atonic  action  which  coaalts  only  after 
executing  a  cosalt  protocol  with  the  requester. 
Thus,  the  results  produced  by  the  "orphans*  are 
discarded  because  the  coaaltsent  protoeal  fails. 
This  scheae  elialaates  the  need  for  a  stable 
storage  at  every  node  at  the  expense  of  deoreassd 
perforsance  due  to  ooaaltaent  protocols. 

The  reliability  of  the  datagrsa  facility  can  bs 
enhanced  by  Introducing  appropriate  reliability 
techniques  Into  the  network  layer  and  the  link 
layer  supporting  this  facility.  At  the  network 
level,  the  network  topology  Is  an  laportant  design 
issue.  A  network  topology  with  higher  coanectlvlcy 
would  generally  exhibit  better  reliability 


ebaraetarl sties.  At  the  link  level,  appropriate 
ratransalsslon  protocols  are  used  to  deal  with 
transient  errors. 


a. 5  Blatrlbuted  Qhleeta 

The  reliability  techniques  at  this  level  deal  with 
aalatalnlng  redundancy  in  the  systea.  The 
redundancy  In  the  systea  Is  aalntalned  in  the  fora 
of  object  replication,  prlaary/ backup  copies,  or 
survlvahle  sets  of  objects.  The  techniques 
suitable  for  aanaglag  redundancy  at  this  level  are 
based  on  the  principles  of  voting  [TH0H79]  ^Gir"?] 
together  with  soae  coaalt  protocol  ILAKP76; 
[GBAn9J.  At  the  distributed  application  level, 
the  reliability  techniques  deal  with  the  synthesis 
of  reliable  objects  ij  redundancy  aanageaeat  and 
the  censtruetloa  of  recoverable  transactions. 
Atonic  transactions  play  a  key  role  at  this  level. 
At  the  application  level,  these  fundaaental 
aechanliBS  are  integrated  Into  soae  higher  level 
teohniques,  such  as  a  recovery  block,  for  systea 
structuring.  forward  error  recovery  baaed  on 
exception  handling  is  an  laportant  part  of 
reliability  techniques  at  this  level. 

The  prehleaa  associated  with  the  asaagaaeat  of 
redundancy  in  the  systaa  have  been  discussed  in 
Section  j.  The  nested  tranasetioa  facility 
provides  a  convenient  and  powerful  abstraction  to 
perfom  atoaic  operations  on  a  set  of  distributed 
objects.  Repliestion  asaagaaeat  tsehaiquas  based 
on  quenaw  or  aajority  eeassasus  are  used  within 
asstsd  tranasetioa  stnietures. 

The  eenespt  of  reoovery  blocks  can  be  used 
conveniently  at  this  level  to  define  a  prlaary 
transaction  along  with  a  set  of  backup  traaaaotions 
and  soae  aceeptaaos  test.  This  can  be  done  easily 
in  our  BOdel  because  transactions  are  atoaic. 
Integrating  the  backward  recovery  techniques,  such 
as  a  recovery  block,  with  forward  error  using 
exception  handling  can  create  very  effective 
recovery  aoehanias  in  a  design.  Such  an 
integration  of  these  two  cenespts  has  been 
deseribed  in  [IBU.77].  For  forward  error  roeevery, 
exception  conditions  oaa  be  associated  with 
prlaiUve  operations  on  objects.  Exception 
handlers  can  be  introduced  within  a  transaction; 
this  does  not  affect  the  atoalclty  of  a 
transaction.  If  a  transaction  is  a  part  of  a 
recovery  block,  an  acceptance  teat  is  applied  on 
Its  cowpletlon,  but  before  its  ceanltaant.  The 
trsnsactlon  Is  cownltted  only  if  this  test  is 
passed  or  else  the  transaction  la  aborted  and  an 
alternate  transaction  Is  tried. 


goBgluaiaw. 

Me  have  presented  an  objeet-orlented  design  nodel 
thst  supports  strueturlng  of  dlstributod  systaas 
for  high  reliability  and  error  recovery.  In  this 
nodel,  we  have  identified  the  error  recovery 
problena  at  the  different  levels  of  functional 
abstraction  and  have  shown  how  various  error 
recovery  techniques  are  integrated  into  this  design 
BOdel.  For  exanple,  techniques  based  on  nultlplo 
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version  cencopt  nro  usod  for  eontruetlnc 
rocovoraOio  opjoets.  cboolcpointlnc  and  coaBltaont 
toeUnlquos  nro  uMd  for  eoostruetinc  ntOBle 
trnnMCtloos,  nod  tdo  toeboidwa  baaod  on 
replication  and  priaarybaekup  aodaa  of  oparatloa 
are  used  for  eonscruetlnc  reliable  dlatribuCed 
objects.  Tbls  aodel  has  been  used  la  the  daalfa  of 
Zeus  CBR(M83]t  an  ebJaet»erlantod  distributed 
operstlnc  systee  for  hlfb  Intecrlty  applloatlona. 
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1.0  nrrMOOcnoM 

TMs  papar  praaanta  tiM  prlnelplaa  foUowad  In 
daalgnlnp  Zaua,  an  objaet*oriantad  dlatrtbutad 
oparatlng  ayataa  Oaalgnad  to  atuOy  Intagratton  of 
racovary  aaenanUaa  into  tna  daalgna  of  dlatrlbutad 
coaaand  and  control  ayataaa.  Ttw  priaary  goal  of 
tna  Zaua  daatgn  la  to  daflna  rallaPla  objaet 
■anagaaant  funetlona  for  dlatrllMtad  coaaand  and 
control  ayataaa  and  to  avaluata  tna  parforaanea  and 
tna  corraetnaaa  of  tna  racovary  aacnanfa  for 
tnaaa  funetlona.  Tharafora,  no  laplaaanatlon  of 
tnia  daalgn  currently  ailata.  Tha  uaar  provided 
funetlona  support  definition  of  objaet  typos, 
creation  of  objaeta,  and  updating  of  dlatrlbutad 
objabta  ualng  atoalc  tranaaetlona.  Ha  are 
currently  evaluating  tha  parforaanea 
cnaraetarlaeiea  of  thla  design  ualng  aiaulatlon 
aodala  and  proving  the  eorraetneaa  of  tha  reeovary 
aacnanlaas  ualng  foraal  natheds  baaed  on  Gypsy 
language  C*EEM3]t  events  and  state  transition 
based  aodala  [TRIF83b],  and  aiaulatlon  aodala.  To 
acniava  tnaaa  goals  wa  have  refined  tha  Zaua  design 
to  a  significantly  detailed  level.  To  data  m  have 
eiplorad  this  design  only  froa  tha  vlaapolnt  of 
tnaaa  goals.  Several  research  probloas  naeaaaary 
to  lapleaant  this  systea  reaaln  unasplorad.  'or 
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•■•■bib.  b  llnguiatle  aaohanlaa  la  naadad  to 
introduce  objaet  type  daflaltlons  Into  tha  systea 
and  to  daflna  procaaeae  and  tranaaetlona. 

A  dlatrlbutad  operating  systea  for  highly 
rallabla  applications  aust  provide  1 )  recovery 
aacnanlaas  that  are  transparent  to  tha  application 
davalopars  and  2)  naalng  aoebaalaas  that  aaka  tna 
physical  distribution  of  objaeta  and  funetlona 
transparent  to  tha  application  prograaaer.  Tha 
aaeond  feature  la  laportant  to  aaka  .evelopaant  of 
dlatrlbutad  aoftwars  no  aora  difficult  than  tha 
davalopaant  of  eonvantlonal  aoftaara  systaas.  Tha 
Zbua  daal«i  has  anda  a  al«Ufleant  contribution  in 
thla  dlraetlon.  Other  ayataaa  have  Intagratad 
these  tab  eoncapta  In  thalr  daalgna,  houovar  they 
typleaUy  Halt  objaet  aaangoaaat  to  tha  file 
storagb  level.  To  daU,  Argus  CLXSOa]  la  tha  only 
other  systea  which  prevldaa  a  sat  of  general 
aarhantsna  for  rallabla  aanagaaont  of  dlatrlbutad 
objaeta  of  any  ’  type.  Zaua  provldaa  thaaa 
aachanlaas  and  addrassas  savaral  other  Issues  such 
aa  objaet  relocation,  authontloatlon  and  objaet 
protaetlon,  not  Included  la  tha  Argus  design. 
Another  novel  faatura  In  Zaua  la  the  Integration  of 
the  eonvantlonal  database  aanagaaont  funetlona  Into 
tha  oparatlng  systea  object  aanagaaont  functions. 
Thla  la  laportant  baesuao  aost  of  tha  todays 
popular  operating  ayataaa  do  not  provide  officiant 
aacnanlaas  for  database  applications  [ST0N81]. 
Evan  with  raapset  to  Its  racovary  aodal,  tha  Zaua 
daslvi  differs  significantly  froa  other  known 
designs. 

Much  of  tha  recant  rasoareh  in  rallabla  systea 
design  la  actually  asploratlon  into  systea 
structuring  tachnlguas.  Thaaa  are  aora  significant 
for  dlatrlbutad  lyatsna  than  conventional 
contrallaod  systaas  baeauaa  distributed  systaas  are 
Intrlnsloally  aora  eoaplas.  A  structured  approach 
can  raduea  daalvi  eoaplaslty  by  factoring  tha 
designs  Into  layers  that  create  different  levels  of 
functional  abstraction;  tha  design  of  a  layer  can 
than  be  carried  out  soaawhat  Indapondantly  of  tha 
design  of  other  layers.  The  layers  in  the  systea 
can  be  visaed  aa  eraattng  herlaontal  partitions  In 
the  systea  daalvi. 


Another  structuring  eoaeapt,  which  Is  dual  aa 
wall  as  orthogonal  to  Uyarlng,  Is 
obJeet<«rlanUtloa  which  creates  vertical 
partltlona  la  the  systea.  The  latareetlona  batwaan 
partltlona  ooeur  through  wall-daflnad  Intarfaeaa; 
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tftM.  weft  In  u»  jystM  rtprtMnej  tn 
indnpMidwit  doMln  yilmr*  ttim  Inwmni  atructur*  of 
a  doaOa  can  not  bo  dlraetly  aecaaaad  by  otnap 
doaalna.  k  vartleaX  partition  aaaontlally  aabodlaa 
Um  eoneopt  of  objoeta  la,  tbo  ayatan.  Tha  miola 
ayataa  La  viawad  aa  a  eoUaetlon  of  objacta.  *11 
atata  tranaforaatlona  in  ona  partition  by  othar 
partltlona  ara  parforaad  tHroufli  tna  Intarfaeaa 
daflnad  by  tlw  partition.  Ttia  advanta|0  of  aucft  an 
approaen  la  Uiat  tna  daalgn  of  tha  Intamal 
atruetura  of  any  pivan  partition  la  indapandant  of 
tna  daalgna  of  otnar  partltlona.  TTiaaa  ara  tna 
fundaaantal  prlnelplaa  of  data  aoatractlon.  Froa 
tha  <>iaapalnt  of  rallabla  ayataa  daalcn.  aueh  an 
approaen  la  vary  attractlva  Bacauaa  It  aupporta 
conflnaaant  of  arrora  within  an  oojact  ooundary. 
Thla  aiao  iaplloa  tnat  tna  racovary  aacnanaiaa  for 
a  (Ivan  partition  can  bo  daalgnad  to  ault  ita 
rallablllty  rapulrananta. 

Tha  eoneapt  of  objaet-^rlantad  daalcn  naa  boon 
uaaO  In  aona  racant  dlatrlbutao  ayataa  daalgna  aueh 
aa  Cronua  tsau83J.  SVJtUUW  CSVOBSt],  Argua 
'Lisnal,  and  in  tha  approach  praaantad  In 
CsntlStl.  Argua  provldaa  objaet-orlantad 
Ungulatlc  aachanlaaa  for  eonatruetlng  rallabla 
dlatrlbutao  ayataaa.  and  SUALLOV  provldaa  rallabla 
oojact  aanagiianc.  Thaaa  ayataaa  do  not  aupport 
aoaa  of  tna  otnar  oporatlng  ayataa  funetlona  auen 
aa  accaaa  control,  naalng,  anarlng,  and  raaourca 
aanagaaane.  Soaa  of  tha  funetlona  aupportad  by 
Zaua,  auen  aa  naalng,  autnantlcatlon,  and 
inttrproeaaa  coaaunlcatlon,  aalat  In  otnar 
oparating  ayataaa  auen  aa  Pilot  CRE0C80]  and 
Crapavlna  CaiMfla],  dovalopad  for  natHortt'baaad 
applleaclona.  Moltnar  of  tnaaa  two  ayataaa  ara 
Rowavar,  ganaral  purpoaa  dlatrlbutao  oparating 
ayataaa. 


Tha  Cronua  oporatlng  ayataa  daalffi  haa 
algnlfleantly  Influancad  tna  oaalgn  of  Zaua, 
largoly  bacauaa  both  thoaa  ayataaa  ara  Intandod  for 
highly  rallabla  applleatlona  auen  am  coaaanO  and 
control  ayataaa.  Zaua  provldaa  uaara  with  rallabla 
oojact  aanagaaont,  which  la  not  praaant  In  tha 
currant  doalgn  of  tha  Cronua  ayataa.  Ilka  Cronua, 
Zaua  naa  tna  cnarac'ar  of  a  ganaral  purpoaa 
oparating  ayataa  aalnly  bacauaa  tha  natvira  of  tha 
coaaand  and  control  applleatlona  Ineludaa  a  wlda 
ranga  of  proeaaalng  cnaraetarlatlea.  Thla  la  la 
anarp  coneraat  to  tna  rapulraoonta  for  banking  or 
alrllna  roaarvatloa  ayataaa  whara  tha  application 
tnvlronaant  la  wall-daflnad.  Zaua  provldaa 
capabllltlaa  for  daflnlng  and  craatlng  objacta  and 
tranaacclona  rapulrad  by  tn#  application  ayataaa. 
It  alao  provldaa  aacnanlaaa  that  aupport  aanagaaant 
of  auen  Objacta  In  a  rallabla  faanlon.  Zaua  can  ba 
uaad  for  eonatruetlng  any  high  rallablllty 
application  ayataa. 


Thla  papar  praaanta  tna  baale  oejaet«orlantad 
building  block  aoenahlaaa  provldad  by  tha  Zaua 
diatributad  oparating  ayataa.  Tha  eoneapt  of 
oejaet  aanagara  la  tna  baala  for  ayataa 
atnieturlng.  An  objaet  aanagar  provldaa  tha 
aneapaulatlon  for  a  givan  typa  of  objacta;  all 
objacta  of  that  typa  ara  aecaaaad  or  updatad  via 
tnat  objace  aanagar.  Tha  objaet>orlantad  racovary 
aedal  undarlylng  tha  Zaua  daalpt  la  daaeribad  in 


CThIP83al.  In  thla  nodal  tha  conatructlon  of 
rallabla  diatributad  objacta  la  baaad  on  an  atoale 
tranaaetlon  facility  and  a  raaota  procadura  call 
aacnanlaa.  Thla  approaen  la  auaaarlzac  In  Flgura 
1. 


OISnXBUTEB  OBJECT  MANACMOIT  FVNCTIONS 
(Partltlonad  anC  Raplleatac  Objacta) 


REMOTE  PROCEOUXE  CAU  HEOiAMia 


ATONIC  TRAJiSACTION  FACILin 


LOCAL  OBJECT  NANACEMEMT  FUNCTIONS 
(Ccncurraney  Control,  Racovary,  Accaaa  Control, 
Objoet  Stcraga  Nutagaaant) 


SEmiL  FUNCTIONS 

(Koat  Raaourca  Nanagaaant,  Coaaunlcatlon,  Schadullng, 
Intarrupe  Handllngj 


HANOVARE 


A  Modal  for  Rallabla  Diatributad  ayataaa 
Flgura  1 

Tho  loMst  layar  In  thla  flgura  rapraaanta  tna 
karTMl  funetlona  that  azaeuta  at  avary  hoat  noda  of 
tna  diatributad  ayataa.  Abova  tha  kamal  Uyar  are 
tna  local  oojact  aanagaaant  funetlona  auen  aa 
atoraga  aanagaaant,  accaaa  control, 
aynenronlzatlon,  and  objact  racovary.  Thla  layar 
rapraaanta  tha  funetlona  that  ara  aaaocUtad  wi:n 
avary  oejaet  aanagar  In  tha  ayataa;  tna  funetlona 
at  thla  laval  daal  only  with  tha  cantrallzad  objact 
aanagaaant.  Tha  nait  Uyar  provtdaa  facility  of 
atoale  tnnaaetlona;  thua,  a  aaquanea  of  oparatlona 
can  ba  parforaad  on  a  aat  of  objacta  In  an  atoale 
fajhlon.  Tha  renota  procadura  call  aachaniaa 
faellltatas  oparatlona  on  objacta  tnat  ara  not 
local.  Va  hava  adoptad  tna  raaota  procadura  call 
■achaalak  bacauaa  It  provldaa  a  uniform  way  of 
aecaaalng  raaota  aa  wall  aa  local  objacta.  Thua 
location  of  tna  objaet  la  tranaparant  to  tha  uaara 
during  accaaa  cr  updata  oparatlona.  It  la 
laportant  to  aaka  tna  aaaantlea  of  ranota  and  local 
procadura  ealU  Idantleal  In  tha  praaanca  of  hoat 
eraanaa  and  eoanunleatlon  link  falXuraa.  In  our 
daal^i  wa  hava  adoptad  tha  *at  aeat  onea'  azacutlon 
aanantlea  for  raaota  prooadura  ealla;  Uiua,  in  tna 
praaanca  of  duplleata  aaaaagaa  or  on  aarvar  noca 
eraalwraatart,  affaetlvaly  only  ona  aiaeutlon  of 
tna  raaota  pmadura  will  occur.  Tha  eoaptnatlon 
of  tna  raaota  procadura  call  aacnanlaa  wltn  tna 
atoale  tranaaetlon  facility  U  uaad  for  aanaiglng 
objacta  tnat  ara  aiehar  partltlonad  or  rapllcattd. 
Baaad  on  tnaaa  aaenanlana  ona  caa  aultably  craaea 
typo  daflnltlona  for  raplleatad  or  partltlonad 
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objects  sucft  tB»t  on«  can  acctas  or  opdato  thaaa 
objoeta  in  tlM  mm  Mimar  u  updaclnf  cmcrailzM 
objbcta. 

TIm  objMt  MnafMMt  mMI  uMd  in  ttw  Zaun 
dulvi  la  buM  on  ta«  ebnenpca  davalopM  In  Ui« 
Hydra  COOHTTS]  daalvi.  In  an  objact-orlantad 
approaen,  uim  ayseaa  la  coaprlaad  of  a  Mt  of 
objaeu,  and  aacn  objaet  la  of  a  Mll*daflnad  typa. 
A  Typa  Nanacar  objaet  for  a  jlvan  typa  aanasaa  ail 
objaeta  of  tnat  typa.  Ail  oparatlona  on  paraanant 
and  snarad  objacta  In  tha  aystaa  ara  aiaeutad  via 
Chair  typa  aanafara.  Thara  art  som  obvloua 
dlffarancaa  batwaan  tha  protaetlon  aodaia  uaad  In 
tha  Hydra  and  Zaua  daalgna.  Tha  protaetlon 
^#cnani3B  In  tha  Zaua  daalgn  la  baaad  on  aeeaaa 
control  llata  Mnila  tha  Hydra  aodal  la  capability 
Baaad.  AlChou«n  both  thaaa  aedala  ara  aqulvalant 
In  taraa  of  tnalr  functionality,  they  differ  with 
raspact  to  Chair  operational  anvlronaant.  Tha 
prlM  raaaon  for  ualn(  tna  aeeaaa  control  llat 
aodal  in  our  dael^i  la  to  be  able  to  eiiance  tna 
aeeaaa  rlftita  dynaalcally.  Alttiouch  it  la  Mt  vary 
affieiant  to  enanga  aeeaaa  rlgnta  dynaalcally  in  a 
capability  baaad  ayatM,  dynaale  changing  of  aeeaaa 
rlgnta  la  laportant  In  a  coaaand  control  ayataa 
uhara  ooaa  of  the  nodaa  aight  be  taken  over  by 
noatlla  forcaa. 

2.0  PRINCIPLES  OP  OISTRIBOTEO  OBJECT-ORIOrrEB 
OESICM  IN  ZEUS 

2.1  Structure  of  Objaet<Orlantad  Syataaa 

An  oojaet*crlanead  systaa  eonalata  of  a 
coUaetlon  of  Typo  Managers  and  tha  objaeta  eraatad 
By  tnaa.  Aa  daaerlbad  above,  the  Typa  Mnagara 
craata  vertical  partltiona  In  tha  ayataa.  For  a 
given  typa  la  tna  ayataa,  a  Typa  NMagar  aould 
ailat  at  all  thOM  nodes  unleh  aay  be  required  to 
store  Objaeta  of  that  typa.  A  Typa  Manager  at  a 
node  aanagaa  all  objaeta  of  that  typa  at  that  node. 
Tha  aultlpla  inataneas  of  Typa  Managers  for  a  type 
function  cooparatlvaly  to  provide  tba  abatraetion 
of  a  slngla  Typo  Ma^or  for  that  type  in  the 
systaa.  Each  Typo  Manager  defines  an  address  space 
In  wnleh  all  tha  objects  of  that  typa  reside.  A 
Typa  Manager  la  logically  viewed  aa  a  single 
process  that  parforas  all  tha  stata  transforastiona 
on  the  objaeta  In  Its  address  space  in  reaponM  to 
azocutlon  requaata  by  som  othar  objacta  of  the 
saM  or  dlffarent  typo. 

At  a  physleal  node,  Mveral  dlffarent  Typa 
Managers  My  reside,  each  aanaglng  objaeta  of  its 
type  at  that  node.  Tha  abatraet  aachine  to  support 
sucn  an  oejaet*«riantad  systM  can  be  eonatruetad 
froa  alaoat  any  hardware/software  systM 
archltaeture.  Tha  systM  archltaetura  of  tha 
proeesaors  to  support  such  a  systM  aust  have:  (i) 
a  aachanisa  for  switching  the  proeasMr  betwaM 
Typo  Managers,  (ii)  a  aeehanlM  for  partitioning 
secondary  aeaory  resoureaa  aMng  Typa  Managers,  and 
(111)  a  sachanlaa  for  azenanging  seaaagM  batwoM 
Typa  Managers. 


It  can  be  som  froa  the  preeodlng  aodal  of 
Typo  Managers  that  Chare  is  no  eoneept  of  a 
systaa-wida  stata  or  unlfora  control  and/or 
recovery  aeehanlsM.  Raaoureo  aqnageaent  functions 
and  recovery  aaebanlsM  are  partltlonad  along  with 
tha  Mt  of  Type  Nuiagars.  Tha  traditional 


functions  of  aystoa««lde  software  units  such  aa 
operating  systaas  and  dataosM  systaas  are 
inearporatad  into  a  collaetion  of  TyM  itenagors 
which  iaplaaMt  tha  basic  aleaanta  of  tha  aodal  of 
distributed  ceapuuclona.  This  la  a  radically  new 
viow  of  oporatlng  syataM. 

Objaet  Typa  Managers  are  me  prlaary  Building 
blocks  for  tha  paraanant  oloMncs  of  the  systaa. 
Tha  Type-Type  Manager  la  an  object  in  tna  systaa 
that  aanagM  "types*  In  tha  systaa.  It  is  the 
aaana  by  which  new  types  are  introduced  into  tr.e 
systM.  Tha  concept  of  tha  Type-Type  Manager  is 
essMtlally  tha  saM  aa  that  of  tna  TTPE-TTPE 
object  In  the  Hydra  design  iCCHE75] 

Tha  objaeta  In  tha  systM  are  accaasad  in  a 
unlfora  fashion  regardlsM  of  their  locations.  All 
oparatlona  on  paraanant  objaeta  are  perforeed 
within  a  transaction.  A  transaction  la  Baaclally 
aa  atoalc  action  that  la  defined  aa  a  sequence  of 
oparatlona  on  local  or  rsMta  objaeta.  A 
traoMCtion  enaurM  atMlelty  of  distributed 
oparatlons.  It  la  poMlbla  to  Introduce 
coneurreney  within  a  traaaaetlon  by  crMtlng  nested 
transactions. 

2.2  Object  MMlng 

The  aoat  basic  requlreMnt  at  the  lowMt  level 
of  tha  systM  arenitaetura  is  to  Identify  and  refer 
to  objaeta  uaaMiguoualy.  This  requlrM  that  esc.n 
objaet  Bust  ba  aaaoclatad  with  som  syatM-wioe 
unique  identifier  (fflO).  In  tha  dMlgn  sooel 
adoptM  for  Zaua,  a  unique  idantlfiar  la  associated 
with  avary  objMt  in  tha  ayatM;  froa  this 
idantlfiar  tha  "type*  of  tha  object  can  m 
inferred.  To  aid  objaet  location  the  Zaua  design 
usM  tha  concape  of  an  "estandad”  lltO.  An  eztended 
UIO  adds  a  "host  hint*  field  to  a  UIO  that 
idMtlflM  tha  host  froa  Mich  tha  objaet  was  aost 
recMtly  aeeaaaad.  Baaad  on  tha  "typa*  field  in 
tha  UIO,  a  refarenea  to  an  objaet  la  directed  to 
tha  appropriate  Typa  Manager  at  the  node  given  sy 
the  "heat  hint*  field  of  tha  UIO. 

2.3  Functions  of  tha  Typa  Managers 

The  functional  characteristics  lapleMnted  by 
tha  Typa  Managers  ara  the  original  Baals  for 
defining  abstraot  data  typM.  Eztanding  abstract 
data  typa  eeneapta  to  Include  a  foraal  Baals  for 
tha  integration  of  recovery,  synenronlMtion,  and 
aeeeM  control  aachanlsH  ganarataa  a  nuaear  of 
additional  funetiona  for  the  Typa  Managers: 

1.  Each  Typa  Manager  Is  directly  responsible 
for  tha  napping  of  tha  oceurencM  of  tna 
objaeta  they  define  to  physical  storage. 

2.  Each  Typa  Manager  laplaaants  access 
control  policies  for  tha  occurrences  of 
its  typa. 

3.  Each  Typa  Manager  supports  concurrent 
azaeutton  of  its  procedures  and/or 
fMotlooa. 

B.  Baoh  Typa  Itenagar  ensures  tha  consistency 
of  tha  objaeta  it  storM  under  concurrent 
and  dlatributM  um. 

5.  Each  Typo  Manager  iapleoenta  tha  necessary 
lavels  of  rediBidaney  to  ensure  tna  level 
of  fbttlt  tolaranea  glvM  in  its 
spaelfieatiott. 
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TtilJ  obvloualy  lnt*«r«tM  uny  functions  tnst  ns«« 
Msn  eonvsatlonsUy  usoelstsd  ulCA  oststess 
systsM  into  tM  oOJoet  Moocoaont  functions  of 
tnis  oporatlnc  systaa. 

2.4  Struetura  of  Tjrpf  Managars 

A  Typa  Nanagar  is  astarnally  vlaaad  as  a 
coUactlon  of  funetions  and  proeaduras  tmicn  can  oa 
invokad  on  tlM  oejaets  of  its  typa  oy  specifying 
tha  idantiflar  of  tha  oOJact  along  wlta  tna 
operation  naan.  This  causes  an  invocation  rapuast 
aassaga  to  be  sent  to  tna  Typa  Manager  ragardlass 
of  its  pnysical  location  in  tna  systaa. 
Internally,  thasa  operations  are  eiacutad  Sy  tha 
Typa  itenagor  using  one  or  aora  server  processes : 
suen  server  processes  aay  oa  dynaaicaliy  created  or 
Cestroyad  by  a  Typa  Hanagar.  Tha  operations  on 
raaota  and  local  objaets  are  invoked  by  the  clients 
in  tna  saaa  fasnion  as  procedure  call.  Such 
invocations  on  raaota  objects  are  parforaad  by 
ispleaantlng  raaota  procedure  calls  [NELSSl} 
[SHXI82]  with  ”at  scat  once  esecution"  saaantics. 
A  Type  Manager  consists  of: 

-  Oata  structures  for  tna  objects  of  that 

type; 

•  Procedures/functions  defining  tna  typa; 

•  Concurrency  protocols; 

-  Recovery  aacnanlsas; 

•  A  database  to  aanage  the  objaets  in  its 
doaain; 

.  A  controUar  process  tnat  senadulas/aieeutas 
tna  recuasts. 

A  typa  Manager  is  responsiola  for  the  permanent 
storaga  of  tna  oojact  instances  of  its  typa.  Eacn 
Typa  Hanagar  interfaces  directly  witn  soaa  sat  of 
parmmnant  storaga  davlcas.  Tha  Typa  Manager 
genaratas  the  sapping  froa  tna  UIO  for  an  object  of 
its  typa  to  tna  pnysical  storaga  on  soaa  paraanant 
storaga  devices.  It  also  realizes  object 
instantiation  in  the  esecutabla  volatile  storaga 
froa  tna  paraanant  storaga.  Thera  is  no 
aystea-wida  file  systaa.  Tha  object  aanagaaant 
systaa  takes  the  place  of  a  file  systaa. 


A  Typa  Manager  consis  i  of  a  controller 
procass  utwsa  purpose  is  to  scAadula  sarvar 
proeassas  to  serve  client  rapuasts.  Tha  sarvar 
process  is  given  the  saaa  UIO  as  that  of  the  client 
process.  Thus,  a  client  process  is  conceptually 
vlaead  as  algratlng  into  tha  address  space  of  tha 
Typa  Manager.  This  visa  of  the  algratlng  client 
process  is  useful  froa  tha  viewpoint  of  enforcing 
access  rlgnts  associated  witn  tna  client  process. 
On  tna  coaplatlon  of  tna  rapuastad  service,  the 
server  process  is  deallocated.  Tha  controller 
process  accepts  tna  incoalng  or  outgoing  invocation 
repuast  aesaages,  par fores  security  checks,  and 
interfaces  with  the  kernel  procedures. 
Effectively,  the  controller  process  plays  the  part 
of  a  local  operating  systaa  for  the  Tyiw  Iteiagar; 
tna  scheduling  policias  can  thus  be  tailored  to  tna 
tPMlflo  rapuirenants  of  the  Typo  NMagara.  Tha 
controUar  procaaa  “*•**!**  the  sarvar  procaasas 
parforalng  the  operations  and  provtdas  thaa  with  a 
sat  of  proeaduras  that  par  fora  rasoureo  aanagaaant, 
cnoMiicatlon,  protection  and  other  sarvleaa  UMt 
are  normally  provided  by  an  operating  systaa. 


A  Typa  Manager 's  controller  has  several 
responsibilities  related  to  protecting  its  objects 
froa  unautnorlzad  aceass.  Upon  receiving  an 
invocation  repuast,  the  controller  aust  obUin  and 
store  tha  rapuaatlng  process'  identification.  This 
Inforaatlon  is  aada  available  to  tna  operation  via 
a  callable  procedure  so  that  tna  Type  Manager's 
controller  aay  check  the  access  list  of  tna  object. 
In  addition,  the  controller  appends  :ne 
idantlflcatlon  of  a  process  wnlch  is  asking  an 
outgoing  invocation  rapuesc  to  soae  stner  Type 
Manager. 

iAtan  an  incoalng  invocation  repuast  is 
received,  the  controller  atteapts  to  locate  :ne 
Object  Whose  UIO  is  given  in  the  repuest.  First, 
the  controller  looks  for  the  object  in  its  own 
local  pool  of  Objects.  If  found,  the  prograa  wnion 
will  parfera  the  operation  on  the  object  is 
paraaetarized  with  the  object's  local  address  and 
then  is  schedulad  as  the  server  process.  If  the 
object  is  not  found  locally,  tna  controller 
dataralnas  if  a  "forwarding  address”  has  bean  left 
for  that  objaet.  This  alght  occur  if  tna  object 
has  bean  relocated  to  soaa  other  host.  If  t.he 
objaet  is  not  found  locally,  tha  controller  sands  a 
reply  eassage  indicating  that  tha  object  was  not 
found  and  includes  tha  forwarding  address  if  any. 

In  response  to  an  updata  repuast,  the  Typo 
Manager  creates  a  new  version  of  tha  object.  This 
version  is  coanittad  only  wnen  tna  transaction  t.hat 
created  it  coaalts;  the  uneoaalttad  versions 
discarded  if  the  transaction  aborts. 

Each  Type  Maaagar  salnUlns  a  database  whit.*' 
records  the  naeaasary  inforaatlon  pertaining  to  t.'^.s 
objaets  In  its  address  space.  This  dataoas« 
records  the  identifiers  of  the  objects  of  that  type 
currently  present  at  that  noda,  their  physical 
addressas,  and  the  coaaltaant  status  of  their  aest 
current  versions.  A  Typa  Manager  is  also 
responsible  for  aborting  an  uneoaaittad  version  if 
it  detects  no  activity  by  the  transaction  that 
crestad  this  version.  Every  tiaa  a  new  version  of 
an  object  is  created  by  a  transaction  by  invoking 
an  update  operation,  the  Typa  Manager  ensures  t.tat 
this  new  version  is  written  onto  the  stable  storage 
before  sanding  an  ackaowladgeaant  for  the 
oparation.  A  seheaa  for  aalntalning  suen  aultlple 
versions  using  differential  files  is  described  in 
[TUlPSal. 

Typa  Managers  are  reaponslbla  for  ensuring 
that  each  of  their  defined  oparations  is  atonic. 
The  operation  Bust  either  eeapleta  successfully  or 
else  abort,  leaving  the  object  coaplataly 
unsedified.  This  is  not  difficult  to  achieve  if 
only  local  objects  are  being  aodified  in  the 
oparation.  However,  if  the  oparation  involves 
invoking  oparations  on  other  Ty^  Managars,  then 
the  controller  uaaa  tha  tranaaetlon  facility  to 
ensure  tha  atoaleity  of  the  updaU.  If  tha  Typa 
Manager  la  strueturad  so  that  operations  say  be 
esaeuced  ecoeurrantly,  the  eoatreUar  ensuraa  that 
objaata  are  not  bal^  lodlflad  by  tae  oparatlona 
aiaultaaoettsly  or  read  by  one  oparation  idtlla  baing 
BOdlftad  by  uwthar.  Eaab  typa,  in  gonaral,  has 
its  oan  sat  of  constraints  on  tha  allowad  order  of 
eseeution  of  its  operations  on  a  given  object. 
These  eonatralnts  arm  supplied  when  the  Type 
Mnagar  is  created. 
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2.5  OlJtrtDutM  Typ«3 


Th*  rMson  for  Introducing  Uw  eeneopt  of 

u.  >«•  w-  “  “  ^ 

cmuptront  tft*  dUtrlftuMd  i»tur*  of  an  objact 
tnat  IJ  lotlcnUy  wldudd  an  a  alngia  objdct.  Thd 
oMpo^ta  of  a^  d«J«et  -ny  ^  dlntrlbutad  by 
rapllcatlon  or  partitioning.  tranyaroMy  of 

UM  rapllcatad  or  partitioned  nature  of  « 
la  a  convenient  abatraetlon  vnicn  aakea  updatli^ 
and  aceeaalng  of  dlatrlbuted  and  centrailaed 
objaeta  Identical. 


1  dlatrlbuted  type  la  an  abatraet  dau 
wtioae  concrete  repreaentatlon  la  dlatrlbuted.  Far 
eiaapie.  an  abatraet  type  called  reliable-file 
alOit  be  lapleeented  ualng  pnyaleally  dlatrlbuted 
replicated  coplea  of  a  file,  or  a  global  databaae 
algRt  be  lapleeented  aa  a  aet  af  partitioned 
dlatrlbuted  coeponenta.  The  conalateney  and 
coordination  aaong  tne  dlatrlbuted  coeponenta  af 
the  concrete  repreaentlon  la  apeclfled  In  the  type 
definition  and  enforced  by  tne  dlatrlbuted  Type 
Hanger.  OnlUte  the  centrailaed  objecta,  an 
occurrence  of  a  dlatrlbuted  type  doea  not  have  a 
unique  noat  location,  l.e. ,  an  object  of  a 
dlatrlbuted  type  eay  "realde**  at  eore  than  one  heat 
for  reliability  and  perforaance  reaaona.  An 
occurence  of  a  dlatrlbuted  type  la  given  a  tllD,  the 
Type  Manager  then  aapa  the  operatlona  directed  to 
thla  UIO  into  a  aet  of  operatlona,  which  are 
executed  aa  a  tranaactlon,  on  the  coeponenta  tnat 
coaprlae  the  dlacrlbutod  object'd  concrete 
repreaentatlon.  Thla  napping  can  be  done  at  any  of 
Che  noata  wnere  the  dlatrlbuted  object  la 

conceptually  "realdlng*.  The  operatlona  defined 
for  a  dlatrlbuted  type  are  lapleeented  aa 

traniMetlona. 


3.0  snwcTChE  or  the  aos  ststem 


Zeua  la  eaaentlally  a  collection  of  Type 
Managera  (THa);  typically,  aany  different  Type 
Managera  coezlat  on  a  heat  node.  The  core  of  the 
operating  ayatea  conalata  of  a  aet  of  Type  Managera 
tnat  support  capabllltlea  for  defining  new  typea 
and  object  inacancas  la  the  ayatea,  autnencleatiea 
of  uaera,  naalng  envlronaent  for  each  uaer,  and 
reliable  proceaa  and  traaaaetlon  aanagaaent 
fuaetlona.  Iheae  ayaten-deflned  Type  Nuagera 
realde  at  every  node  in  tne  ayatea. 


The  lotMat  level  of  operating  ayatea  at  eacn 
node  la  called  the  Kernel;  the  kernel  vlrtuallxae 
the  reaourcea  at  the  host  ao  that  eacn  Type  Manager 
can  be  viewed  aa  having  Its  omi  virtual  procesaor. 
The  Kernel  aupporta  Interproceaa  coaaunleatloa, 
priaary  storage  aanageaent,  proeeaaor  aehedullng, 
interfaces  to  secondary  storage  devices,  and  UIO 
generation.  Aa  shown  In  figure  2,  all  Type 
Managers  at  a  node  esaoute  over  tha  abatreat 
aachlna  Interface  provided  by  tha  kernel.  The 
kernel  aultlplexaa  the  proeeaaor  botwoen  the  Type 
Managers;  It  also  handles  all  Intemipta  dua  to 
storage  devices  and  the  coaaunleatlon  devices. 


3. 1  Structure  of  the  Zeus  Kernel 

The  kernel  conalata  of  a  teak  dispatcher  and  a 
nuabsr  of  Interrupt  handlers.  The  task  dispatcher 
schedules  the  different  Type  Managers  at  its  noat 
node  and  handles  their  requests  for  resources.  It 
also  handles  the  restart  of  the  aystea  and 
Initiation  of  the  Typo  Managers.  The  resources 
— by  the  kernel  include  volatile  and 
non- volatile  storage,  the  processor  and  the 
coaaualeatlon  handler.  The  kernel  Interface 
consists  priaarlly  of  three  parts:  Invocacisn 
requests  to  other  Type  Managera,  requeata  fer 
unique  nuaoers,  and  requests  for  resources. 
Storage  aanageaent  In  the  kernel  la  alnlaal. 
Storage  la  available  In  fixed  alaed  blocks  and  :ne 
Type  Managers  request  one  or  sere  of  these  blocks 
at  any  tlae.  A  Type  Nuiager  la  solely  responsible 
far  the  data  It  writes  to  the  blocks  of  storage. 
The  kernel  keeps  track  of  the  ownership  of  blocks 
of  storage.  The  routing  of  Invocation  requeata  to 
Type  Managers  la  the  sajer  function  of  the  kernel. 
EaM  call  la  an  operation  Invoked  against  object 
that  la  held  by  aoae  Type  Manager.  .eration 
Switch,  wnich  la  a  cosponent  of  tne  kernel, 
aupporta  thla  function. 


3.1.1  The  Operation  Switch 

The  function  of  the  Operation  Switch  la  :o 
forward  an  invocation  request  to  the  appropriate 
Type  Manager  at  the  local  or  a  reaote  node.  These 
calls  Bay  be  froa  a  Type  Manager  or  fros  tne 
network  driver.  Each  call  eonuins  the  following 
Inforaatlon: 

1.  The  extended  DID  of  the  object  against 
Which  the  call  la  invoked. 

2.  The  extended  UIO  of  the  process  Invoking 
the  operation. 

3.  The  extended  UID  of  the  principal  on  wnoae 
behalf  the  operation  la  being  invoked. 

4.  The  operation  and  a  sec  of  paraaetara. 

The  Operation  Switch  uses  tne  host  hint  field 
of  the  target  object's  extended  UIO  to  deteralne 
whether  the  object  la  oa  the  host  or  not.  If  It 
la.  It  uses  the  type  unique  nuaber  of  the  object  to 
direct  the  call  to  the  proper  Tv— !  Manager.  If  the 
objeet  Is  on  another  host,  th.  Operation  Switch 
Instrueta  the  Network  Handler  to  send  the  call  to 
the  other  noat. 


3.1.2  Unique  Identifier  Generation 

The  "type”  and  "lastanee*  fields  of  an 
extended  UID  are  unique  nuabers.  Bach  of  these 
vaiiqus  mabars  eonsiata  of  three  fields,  the  host 
Ideatlflar  of  the  heat  at  union  they  were 
ganarated,  the  Ineamatloa  nudbar  and  Che  sequence 
niTir  wlthlA  aa  incarnation  maber.  The  kernel 
coatalna  a  eoaponent,  called  t.'.a  SnallStepper ,  tnat 


154 


tenants  seaa  r*nf  af  uniqua  nuaoara.  Thia 
CiooMnc  aoulM  tna  sapaoUity  to  ganar.e* 
.ulUoi*  r*"*"  3f  MLH\M  nuaoara 
ilatrlbucad  oojact  esllad  L»rt«St«pp*r  • 

nmoae  U  OBUinad  froB  tBB  SaallStappar  * 
Mtilea  raaidaa  only  la  tun  wUtllo  ^ 

SanllStappnr  laauM  aapunnen  nuaonra  for  a  glvnn 
incarmtlon  nuaBnr. 


in  a  ayaeao  »nnra  no  fallurta  can  oeetir.  ancii 
noac  alll  gannrnta  a  aonotonicaiiy  Incranatnd 
aonuonco  of  unloun  lOontlflora.  If  >*•  pnmlt 
faliuroj.  But  aelpulat#  tnae  avory  none  in  um 
syatan  naa  ataola  atoratO)  tnan  aaen  noat  will 
atera  tna  nait  incarnation  nuaoar ,  ano  aa  aoon  aa 
raacarta  an  araan  racovaryi  It  »<111  ratrtava 
tnla  nuaoar  ano  urlta  to  ataola  acoraga  tfta  nait 
Incarnation  nuaoar.  TTiua  avan  tnougn  aoaa  part  of 
i  ranga  of  aaouanca  nuaoara  aay  not  oa  ganaratao. 
-..na  noata  uiU  ganarata  a  aonotonicaiiy  ineraaatng 
aapuanca  of  uni qua  nuaoara. 


aaetion,  tna  Typa  Muiagara  for  tnaaa  ayatan  t/paa 
ara  cafinao.  Tha  rollowing  arc  :na  Syataa  T/p# 
Hanagara  union  aziat  at  aaen  neoa  in  tna  ayataa. 

(1)  Typa^Typa  Nanagar 

(2)  Proeana/Tranaaction  ianagtr 

(3)  Principal  and  Autnantication  nanagar 

(U)  Syneoile  Maaa  Managar 

( 5 )  Prograa  nanagar 

(6)  Haaaaga  nanagar 

Tt»a  funettona  providad  6y  tnaaa  Typa  <ianagtrs  »:or.g 
ultn  tnair  atructuraa  ara  caac.-lpao  oaisu.  -jc.*. 
tnaaa  Typa  Hanagara  ia  conaicartc  u  an  :f 

ClatributaO  typa;  an  inatanca  sf  eac.n  t.'  -.re'se  */>• 
Hanagara  raaioaa  at  every  noca.  TT.e  ciai.- 

Typa  Hanagara  for  a  given  :vpe 

cooparatlvaly  to  provide  tna  acac.*ac::on  i 
aingla  ayataa-wlda  Typa  Nanager. 

3.2.1  Typa-Typa  Manager 


If  ua  raaova  tna  aaauaption  of  ataola  atoraga 
on  all  noata  in  tna  ayataa.  tnan  noata  In  tna 
syataa  can  oa  dlvidad  into  two  claaaaa:  cnoaa  tnat 
poaaaaa  ataola  atoraga  and  tnoaa  tnae  do  net.  Sacn 
noat  ultn  ataola  atoraga  la  addition  to  tna 
SaallStappar  naa  a  procaaa  callad  tna  LargaStappar 
union  togatnar  uttn  tna  otnar  LargaStappara  in  tna 
syataa  ganarataa  nau  incarnation  nuaoara.  Ttia 
algorltna  uaad  to  do  tnia  ia  apaelflad  in  a 
saparaea  paper  C0inT83]. 

3.1.3  Matuerk  Handler 


Tha  daflnltiona  of  nau  Typa  Hanagara  :a 
ineroduead  in  tna  ayataa  By  uaing  tna  aacaaniaao 
auppertad  by  a  ayscaa-alda  oojaet  callad  tna 
Typa-Typa  Managar.  Thua,  tna  Typa-Typa  Manager 
iaplaaanta  funeelona  to  eraata,  altar,  dalata  and 
raplleata  Typa  Hanagara.  Tha  dafinitien  of  ".r.t 
Typa-Typa  oBJaet  glvan  nara  la  an  adaptation  ano 
aatanaton  of  tAa  Typa-Typa  eoneapea  originating  i.n 
tha  HTSM  oparacing  ayataa.  The 
faellitlaa  prevldad  By  tna  Typa-Typa  Har.a^er 
ineiuda  an  aaplieit  coaaand  on  unara  to  loca:* 
eoplaa  of  a  Typa  Managar. 


Thia  ooaponant  provldaa  a  aiapla  datagraa 
laval  of  transport  aacnanlaa  Bataaan  dlffarant 
uarnaU.  It  tntarfaeaa  uitn  tna  Oparatlon  Saltcfl. 
Tha  invoeaeion  raduaats  for  raaota  nodaa  ara  nandad 
owar  By  tna  Oparatlon  Salten  to  tna  MatkOfk 
Handiar,  union  naa  tna  raaponslBiUty  far 
aailvaring  It  to  tna  Oparaiion  SaitcA  at  tna 
caatintion  noat.  Slallarly  tna  raaponaa  aaaaagaa 
ara  ratumad  frcd  tna  sarvar  to  tna  invoicar  By  tna 
natwork  nandiar  via  tna  Oparatlon  Swlton. 

3.1.0  Kama!  InitUtor 

Tha  uamai  initiator  naa  tuo  funetlona.  Tha 
first  rjiction  ia  to  raatart  a  noat  laian  it 
rteovars  froa  a  failura.  Tha  aaeoad  ia  to  laltuca 
a  taaa.  Bata  tasks  raquira  a  eartaia  aaowit  of 
nousMaaping.  ttoat  raoovary  iaplias  tha  aacting  up 
of  taBlaa  for  tna  diapacoBar  of  Uia  kamal.  uaii^ 
tna  log  for  tna  Ty^Typa  Managar  to  eraata. 
dalata,  or  aedlfy  tna  Ty^  Managara  on  tna  noat, 
and  ootalnlng  a  naa  ineamaclon  nuaoar  and  tna 
3aall5tcppar  sacuanea  niaoar.  Aftar  tna  aOova 
actions  ara  auccasafully  coaplatad,  tna  initiator 
tan  nano  toneroi  to  inm  task  diapaecnar. 


;■<  Syataa-Oaflnad  Typa  Managara 


Aa  aantlonad  pravioualy,  Zaua  la  a  sat  of  Typa 
Managara  unoaa  aaaoars  aay  potantlally  enanga 
dynaaleally  aa  Typa  Mmagara  ara  eraatad,  daiaead, 
and  aodlflad.  Thara  la.  nowavar,  a  supaat  of  Typa 
Managara  callad  tna  Syataa  Typa  Managar  utilen 
parfora  tna  aaaantial  aarvleaa  providad  By  tna 
<amal  of  a  convanelonal  oparatlng  syataa.  In  cAla 


Typa  aanagara  art  aetlva  objaeta.  At  any 
peine  in  eiaa,  ona  or  aara  eoplaa  of  tna  Type 
Managar  for  a  glvan  typa  aay  ba  aetlva.  Sy  activ- 
aa  aaan  that  aithar  wttnin  tna  Typa  Managar  calls 
agalaat  lea  oBJaet  instancaa  ara  in  pregrasa,  sr 
tnae  aoaa  of  tna  fuenttona  it  Iaplaaanta  .nava 
Invekad  calls  oa  aoaa  otnar  Typa  Managar  and  are 
Halting  for  a  raeum.  This  caaplieataa  tna 
Typa-Typa  aanagar  Baeauaa  It  auat  ensura  that  all 
eoplaa  of  a  Typa  Managar  ara  in  a  qultaeant  seat# 
and  will  stay  In  tnat  stata  Bafora  an  apane!or  can 
Ba  Invokad  against  that  Typa  Managar.  Houaver.  u* 
Ballava  tnat  oparatlona  to  aodlfy  existing  T/pa 
Managara  ulll  oa  qulta  Infrequent,  tnarefora, 
acftaaaa  Baaad  on  global  synenronizatlon  can  oa  oaao 
for  eonalataney  aanagaaane. 

3.2.2  Procaas/Tranaaetlon  Managar 

froeaaaaa  and  tranaactlona  ara  active  cojaeta 
In  tna  ayataa  tnrougn  union  a  uaar  carries  out 
oparatlona  in  tna  ayataa.  Tranaactlona  ara  atoaic 
proeaasaa,  l.a.  tnay  nave  an  “all  or  nat.’<.i.tg* 
property.  Tha  transaction  facility  ultn  its  atoaic 
proparey  provides  a  powarful  aacnaniaa  for  rtliaoia 
oparations.  A  transaction  aitnar  coaaita  cr  aoorts 
on  taralnatlon,  and  if  it  aoorts  tnan  no  trace  cf 
its  asaeutlon  ia  left.  On  tna  coMicaanc  of  a 
traaaaetlon,  all  updaeaa  aada  By  it  ara  paraananc. 

Va  raqulra  tnat  a  procaaa  auat  inveka  a 
tranaaectaa  la  ordar  to  aedlfy  pamanane  snarac 
oBjaeta  in  tna  ayataa.  Tha  enangaa  to  an  oejaee 
ara  raeordad  aa  naa  varalona  of  tna  oojaet.  Maw 
varatena  of  tna  oejact  ara  coaaietad  to  Dacoaing 
paraanant  at  tna  and  of  a  auecaaaful  eeaplaeion  of 


155 


Copy  availoblo  to  DxIC  dc.  \::i 

parnd)  fully  legible  leproduction 


m*  coaaic  procoeoi  uont  ina  invoKM  cranMCCion, 
tft«  tnvolilJH  proe«M<  ^VP*  of  tB« 

■Miriad  oejccu.  Uncoaaltead  vtrslona  arc 
4ljearM4  on  upilelt  afiort  eoawnda  ImuM  iy  uw 
eranMceloa  procw  op  on  claaoue  due  to 
inueivtty. 

rrrtrMtu  and  tranaoetloaa  can  aataoilaB 

rocovory  pelau  by  eiMdfpolatln«.  Suen  polnta  aro 
u3«d  for  tao  purpoao  of  raUbaek  and  raoeart  of  a 
procaaa  or  tranaaetlon.  Chaelcpolnein«  la  taa 

saiaetlva  aavlng  of  varalona  of  procaaa  or 
tranaaetlon  objacta.  Mota  tnat  uith  cna  abova 
acnaaa  for  cnacapolntlng,  only  tna  atata  of  cna 
procaaa  (or  tranaaetlon)  objact  la  aavad;  tna 
stacaa  of  objacta  aodlflod  by  tnat  procaaa  ara  not 
aavad  In  tna  enackpelnt.  Ttala  approacn  aay  craata 
prcblaaa  for  arror  racovary  alnea  not  all  atata 
onangaa  of  tna  procaaa  ara  raeordad  MltA  tna 
tnacapolnt.  Honavar,  ona  auat  riaaaoar  tnat  all 
aada  Mitftln  a  tranaaetlon  to  paraanant 
objacta  vu  tnalr  Typa  Nanagara  ara  aavad  on  tfta 
atabla  atoraga  aa  uneoaalttad  varalona.  It  la, 
tnarafora,  nacaaaary  to  oiaraiaa  aoaa  dlaelpllna  In 
ualng  enackpolnta  and  atoale  tranaaetlona.  Tha 
following  dlacuaaaa  now  enaekpolntlng  can  ba  uaad 
eorraetly  to  aupport  racovary. 

Tha  flrat  problaa  tnat  ua  want  to  addraaa  la 
now  ona  guarantaaa  corraet  rollback  racovary.  Ona 
raoulraaant  for  eorraetly  laplaaantlng  rollback  la 
tnat  tna  aojact  aanagar  for  tranaaetlona  auat 
aalntaln  for  oacn  tranaaetlon  a  Hat  of  UIDa  of  tna 
Objacta  tnat  ara  affaetad  by  tna  tranaaetlon.  Tha 
raaaon  for  tnia  la  tnat  in  tna  avant  of  a  rollback. 
It  raqulraa  tnat  all  cnangaa  to  objaeta  aada  by  tna 
tranaaetlon  after  tna  enaekpolnt  ba  dlaeardad.  The 
llat'of  UlOa  of  tna  objacta  tnat  ara  affaetad  by 
tna  tranaaetlon  provldaa  a  aaana  for  notifying 
tnaaa  objacta  to  dlaeard  tna  abortad  varalena. 
mia  llat  la  alao  uaad  at  tna  and  of  the 
tranaaetlon  to  eonduet  tna  eoaalt  protocol.  Ttm 
dlaeuaalon  In  tna  praeadlng  paragrapn  lapUaa  tnat 
for  laplaaantlng  rollback,  tlaaataapa  auat  ba 
raeordad  wltn  aaen  enaekpolnt  and  aaen  varaloo  of 
oojaeta.  Tnia  raoulraaant  staaa  froa  tna  fact  tnat 
tnara  la  net  a  ona-to-ona  aapplng  batwaan  procaaa 
or  tranaaetlon  enackpolnta  and  varalona  of  objacta 
affaetad  by  tnaa. 


The  aaeend  problaa  la  tna  Intaractloa  batwaan 
procaaa  enaekpolntlng  and  eeaaltaoat  of  a 
tranaaetlon  Invoked  by  tna  procaaa  after  tnat 
enaekpolnt.  Suppeaa  a  procaaa  eraanaa  after 
eoaalttlng  a  tranaaetlon.  In  auen  a  caaa  tna 
procaaa  raatarta  froa  Ita  laat  enaekpolnt,  but  tna 
tranaaetlona  tnat  have  bean  eoaalttad  alnea  tna 
•ataollanaant  of  tnia  enaekpolnt  ara  not  laideno. 
Thua  aoaa  eoaalttad  tranaaetlona  algnt  ba  oiaeutad 
aora  tnan  onea  due  to  tna  raatart.  If  a 
tranaaetlon  la  nonldaapotant,  l.a.  aultlpla 
aiaeutlona  of  tna  tranaaetlon  produea  dlffarant 
raaulta,  a  problaa  aay  arlaa  In  arror  racovary 
alnea  rollback  of  tna  procaaa  aay  eauaa  a  eoaalttad 
tranaaetlon  to  ba  oiaeutad  again.  One  aolutloa  to 
tnia  problaa  U  to  aiwaya  ferca  Uia  lavekli^ 
procaaa  to  enaekpolnt  coneurrantly  wltb  tiM 
eeaalttlng  tranaaetlon.  Chaekpolntlng  la  part  of 
tna  eoaalt  protoeel:  If  tna  protocol  dataralnaa  to 
abort,  tna  enaekpolnt  la  dlaeardad.  tfltb  tlUa 


aaneatory  enaekpolnt,  rollback  racovaM  of  a 
procaaa  can  avoid  undaalraela  ra^at^tion  of 
tranaaetlon  aiaeutlon.  Howavar,  tnia  aay  (30 

fraquant  enaekpolntlng  of  tna  invoking  procaaa. 
Tharafora  tna  aaeood  aolutlon  la  to  — • 
enaekpolnt  of  tna  invoking  procaaa  an  option  tnat 
la  to  ba  apaelflod  at  tna  tlaa  of  Invoking  a 
tranaaetlon.  Apparently  tnia  enaekpolnt  la  not 
roculrad  for  Idaapotant  tranaaetlona  to  guarantta 
corraet  aiaeutlon.  Howavar,  a  procaaa  Invoking  an 
Idaapotant  tranaaetlon  aay  alact  to  ferea  a 
enaekpolnt  eurlng  tna  tranaaetlon  eeaalt  pr-j-.eecl 
for  afflclancy  raaaona.  Far  eiaap.a.  .f  i 
tranaaetlon  raqulraa  aitanalva  eoaputatlen  :easa.*«<: 
to  enaekpolntlng  tna  Invoking  procaaa,  ano  if  ire 
poaalblllty  of  a  fallura  la  algnlflcant,  l;  aa/  oe 
daalraola  to  nava  a  enaekpolnt  aa  eaacrlbao  above, 
ma  daelalon  of  wnan  to  enaekpolnt  la  left  :o  ina 
proeaaa  tnat  invokea  tna  tranaaetlon. 

The  froeaaa/Tranaaetlon  Hanagar  alao  lupporta 
naatlng  of  tranaaetlona;  auen  naatad  tranaaetlona 
can  aiaeuta  eoaeurraetly  wltn  tn«  parent 
tranaaetlona.  Tha  naatad  tranaaetlon  facility 
provldaa  tna  uaara  aaenanlaaa  to  introduce 
eoneurraoey  wltnin  a  tranaaetlon.  The  eoaaltaant 
of  a  naatad  tranaaetlon  la  dependant  on  tna 
eoaaltaant  of  tna  parent  tranaaetlon. 

3.2.3  Prlnelpal  and  Autnantlcatlon  Manager 

Tha  objaet  protection  syitea  In  Zaua  dapanea 
on  tna  ability  of  tna  individual  Typo  Managers  :o 
Identify  any  proeaaa  Mien  raqueata  an  oporatlen  ra 
parforaad.  In  addition,  tna  type  Managara  need 
ba  able  to  datamlna  taa  ultlaata  Initiator  of  ’..re 
aetlon  Mlea  raaultad  la  auen  an  invoeatlor. 
raquaat.  Ha  call  tnaaa  Inltlatora  of  aetiorj 
prlnelpala.  Prlnelpala  ara  paraanant  objacu 
Zaua  and  tnay  ara  the  only  objaeta  Mien  carry  t.*ia 
autnorlty  to  parfoni  eoapuutiona  Involving  outer 
Objaeta.  Wmwi  a  new  procaaa  la  craatad.  It  is 

"owned*  by  a  alngla  prlnelpal  and  It  ratalna  tnia 
prlnelpal  aaaeelatlon  tnrougneut  Ita  llfatlaa. 

The  two  fundaaantai  probleaa  of  tna  protection 
ayatao,  autnantlcatlon  and  autnorlzatlon,  sotn 
Involve  principal  objaeta  and  tna  aaaoclation  of 
proceaaaa  to  prlnelpala.  The  problaa  of 

autnorlzatlon,  tnat  la  dataralnlng  on  wnoaa  bana.f 
a  given  proeaaa  la  eurrantly  working,  la  a  fairly 
alapla  Mttar  alnea  aaen  proeaaa  la  alMya  working 
for  a  alngla  prlnelpal  only.  Hhan  a  procaaa 
Invokas  an  operation  on  a  Typa  Manager,  tna 
Inforwtlon  regarding  Its  (ItO  and  prlnelpal 
aaaeelatlon  la  tranaportad  onto  tna  virtual  aacnina 
of  tna  target  Typa  Manager.  In  tnia  way,  tna 
prlnelpal  Mien  owna  a  particular  procaaa  ia  alMys 
known  by  any  Typo  Managar  on  Mien  It  aakaa 
Invocation  raquaaea.  In  addition,  alnea  procaaa 
Idontlflara  ara  tranaportad  to  and  froa  T/pa 
)teagor  aacninaa  by  aystaa  coda,  a  procaaa  La 
unable  to  forgo  Its  om  principal  association  to 
gain  aeeaaa  to  objaeta  Its  real  prlnelpal  la  not 
autnerlzad  to  aeeaaa. 

Oirlag  login,  a  uaar  la  first  aaksd  ta  identify 
niaaalf  by  giving  nis  lolqua  prlnelpal  ayubellc 
naaa.  The  login  proeaaa  (also  eallad  tna 
Autaaitleatlon  Manager)  trlaa  to  find  a  prlnelpal 
objaet  eontalnlng  tna  aaaa  aynPolle  naaa.  Tba 
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prlncloaL  30j*ct  :onuins  tn*  p#rtln«nt 
lnfor««Clon  aoouC  that  u3«r.  Th«  ujar's  puaword 
Ij  searad  vttn  ui«  principal  oojcee,  aliawlnc  CR« 
luuwntlcatlon  HMM«r  to  parfara  naeacMrr 
autMieleattoii  caaeks.  Tm  otnar  piacaa  of 
inforaacloa  rafwdlnt  taa  viaar  ara  aaintalnad 
wttnin  uw  prlnelpal  data  oojaet.  Ona  la  uta 
ofilqua  Idaneiftar  (010)  of  tM  uaar'a  ayaoolle  naaa 
concaic.  Hnlea  la  daacrlbad  la  caa  nact  aactloa. 
TTm  aeiiar  la  UM  UIO  of  Um  coaaaiid  Inearpraear  or 
snail  prairaa  af  uta  louad«ln  principal. 

Slnca  cna  auutantlcaclon  lanagar  tuaC  find  a 
principal  sojacc  (Ivan  only  Isa  syaoollc  naaa.  ic 
follawa  inac  tnia  naaa  luae  oa  unlqua.  In  srdar  to 
aaaa  U  convanlanc  for  unlqua  naaaa  ca  Oa  aaalfnad 
CO  prlnclpala.  laua  naa  uta  ccncage  of  a  worklnf 
(roup  (WC).  Working  groupa  ara  oaad  to  fora  a 
scrlcc  niararcny  of  principal  naaaa.  Thla 
nlarmreny  of  naaaa  la  slallar  co  utac  usad  in  tna 
Hulclea  syacas.  Thay  contain  aanoara  anicn  aay  oa 
aicnar  prlnclpala  or  ocnar  working  groupa.  Tha 
root  working  group  naa  a  null  naaa  and  la  callod 
uta  null  working  group.  Tho  unlquo  naa#  of  a 
principal  or  working  group  la  foraad  ey 
concatanatlng  um  naaa  of  uia  prlnelpal  or  WC  wlCR 
uta  naaaa  of  ail  of  Its  containing  working  groupa. 
mis  nlararenlcal  structure  also  forma  uto  oaala 
for  outer  sympollc  naaaa  in  tna  systaa. 

3.2.4  jyaaollc  Haaa  >tetagar 

To  provide  uaar  convanlanca.  an  oOjact  can  Oa 
given  a  syaoollc  naaa  tnat  Is  usad  idtan  rafaronelng 
uut  oojoet.  4  uadr  In  tna  systaa  slwuld  be  aoia 
to  usa  sypMlle  naaaa  wttnin  Its  contest 
Ifldapanoant  of  outer  users .  For  aiaapla,  Uta  saaa 
syaeollc  naaa  can  oa  uaao  oy  dlfrerant  uaars  to 
raftr  to  dlffarane  oejoets.  SlalUrlr,  dlfranme 
symeelle  naaaa  can  oo  usad  ey  dlfferant  uaars  to 
refer  to  uta  saaa  oejact.  Tha  Syaoollo  Naaa 
Nanagar  aalntalns  tna  aaoplng  eatwaan  a  syiDelle 
naaa  for  an  object  and  tnat  oojaet's  (TXO.  Tha 
sapping  function  is  oany-ta«ona  In  that  sawaral 
symbolic  naaam  may  oa  mapped  to  ono  object  HID. 
Tha  symocllc  naaas  wltnin  a  contaxt  sust  Oa  unlqua. 

Tha  objects  wnicn  are  saitagad  by  Uia  Sytaballe 
Wsaa  Nmnagar  ara  symocllc  naaa  contazts,  wttara  a 
tontait  Object  contalna  uta  above  aantlonad 
tapping.  *  cpntast  aay  ba  ylaaad  aa  a  private 
directory  of  roiatlva  symoolle  naaaa.  Gaeii 
principal  la  given  a  oontast  wnan  tlia  principal  la 
created.  It  la  Inltlaliaad  wlUi  Uta  symocllc  naaa 
to  oDjact  UIO  aapplngj  of  certain  systaa  objacta 
vnicn  a  prlnelpal  suat  know  in  order  to  function 
cropart/.  Tha  Symoolle  Haaa  dsnagar  aalntalna  a 
tata  oasa  t.nat  tontalna  uta  contest  oojtcts  and  tna 
torrent  state  of  tna  contest  cparatlona.  In  Uta 
«vanc  of  a  failure  uta  data  base  provides  Uta 
recovery  of  tna  state  of  tna  Symoolle  Naaa  Nanegar 
ano  tna  recovery  of  Uta  contest  oojacta. 


Contaita,  Ilka  other  objects,  aay  be  snared 
aeong  principals  navlng  Uta  proper  aeeeaa  and  are 
part  of  UM  eecnenlsB  by  valen  principals  aay  atiarm 
objects.  If  principal  k  wlanca  to  snare  object  X 
wiui  principal  8,  k  oust  give  8  access  rlgnts  to  X 
arc  sust  also  givo  8  Uta  UIO  of  Z.  Whan  «  snares 
'.ta  contest  wlut  8,  8  la  sola  to  oetaln  Uta  UIO  of 
I  Utrougn  an  agreac  upon  symoolle  naa#  for  I.  It 
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la  laportant  to  nots  tnat  snaring  a  contest  i-i  *o 
way  ennances  or  altars  uta  acca.-'  rights  ta  any  ;f 
tna  aejacta  mtoae  UIDs  ara  in  a  snaraO  contest, 
tecaaa  to  an  object  Is  stin  voordlnacao  oy  its 
saaoelatad  Type  NMagor. 

3.2.5  Tha  Frograa  Type  Nanagar 

Tha  Frogrma  Type  Hansgar  is  tna  -eoosiicrv  ;r 
both  progrma  tait  and  object  coda.  ?*cgraa  -.ef.  .3 
defined  to  oa  a  test  co.'ect  -..-jt  ::ici.»3 
correctly.  Thus,  tna  creation  :f  i  cre^ru  .c 
requires  tn#  user  to  suoo./  c.-e  *arj^.- 

wltn  a  correct  prograa  cr  a  jecarice./  ;;ic..ic.s 
•onlt  of  a  prograa.  TT.e  Pre^raa  '.c^ 
addition  to  its  f'jnctlon  as  a  .-eposcccr.'  iccj  as  a 
Oulldar  of  prograaa.  Thus,  a  cser  ;a  :a..  .sen 
tna  prograa  Type  Hanagar  to  ouiio  a  new  crc^.-ia 
froa  soaa  specified  coeponants.  This  .l.-jti.ng 
function  of  tn#  Prograa  Type  Hanagtr  is  esafj.  :s 
UM  systaa  to  Oulld  new  user  types-  1  pre^raa 
Object  la  defined  to  oa  a  collection  of  <arsions  cf 
a  single  progrma.  Tha  crlcarla  for  retaining 
progrma  voralona  In  uta  systaa  are  cafinao  :y  c.'.a 
uaars. 


3.2.6  Nassaga  Type  Hanagar 

Tha  Naaaaga  Type  danger  prow  idea  .'cr  :r» 
synePrcfMua  and  ssynenronous  asenanga  of  sassage 
Oetwaan  procasaaa.  it  tna  tlsa  a  laaaaga  .3 
created,  tna  sender  can  specify  tna  rallaoi.::/ 
class  for  tnat  saaaaga.  Tha  rallaeillty  class  % 
aaaaaga  rtflscta  Its  avallaolllty  to  tna  rec«:  tr 
In  UM  face  of  ona  or  sore  nose  failures  m  « 
natMork.  16  tna  lorn  and  of  rallabliity  tnars  i.  e 
voUtlls  saaaagd  objacta  tnat  disappear  upon  -cjt 
fallurm  (If  tn#  objaee  raaldas  on  tna  failed  nest.. 
It  tna  nt0t  end  of  reliability  staple  lessa^e 
objects  navd  s  replication  factor  of  n  wnere  *.  .s 
tna  niauar  of  neats  In  uta  network.  Tvo 
sddltlonallntarmodlata  reliability  elassoa  etlst. 

Intarprccess  camauntcstlon  say  occur  catween 
processes  tnat  are  local  to  a  noat,  or  reaote.  :n 
either  case  eesaage  aparatlans  are  perforeed  oy  cr.e 
massage  type  manager  Ipcsl  ta  trie  tost  c.'  :rt 
Invoking  process.  iny  resote  ccmaunica: '.cn 
required  oy  the  operation  is  done  oy  poer  w«334;« 
Type  Nuagars  and  Is  unseen  oy  tna  processes 
Involved. 


v.o  C3NaU2rCN5 

Tn  tttls  caper  w*  nave  prvssnted  in  ;v»r.-.iw 
tne  laua  dlstriouted  operating  systeo  .nic.n  .3 
suitable  for  nigiily  reliable  applications.  Teus  .3 
an  oojeet-orlaneed  systaa  wnicn  la  novel  in  -.no 
sense  that  it  Intsgratss  aany  of  tne  ccnventlcra. 
datsbaaa  aanagaaant  functions  into  tno  cperiti.ng 
systaa.  Pacovory  and  syncnronlzatlon  are 

transparent  ta  tno  application  progrsaaers.  The 
softwra  for  tnla  systaa  looka  no  different  tnan 
tiM  eoRvancional  seftwarb  bacausa  of  the  reaote 
procedure  call  oacAanlaa  wnlcR  aakaa  accessing  of 
i"eaota  and  local  objects  identical.  This 

distributed  operating  systaa  la  oaalcally  1 
collection  of  systaa  and  application  oeflneo  oo.'ect 
■anagors  (also  referred  to  aa  Typo  dmnagers  oecause 
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tnmy  ca«  oojcctJ  af  »  specific  type).  In 

uils  paper  m  nave  aeacrlbed  soae  systea  aeflnac 
Type  Nanacera  tAat  previce  We  eaaenClai  faelUtlea 
CO  UM  appUeacloa  oeveiopara  for  eraaeinc  cnalr 
own  Typo  Nuacera.  TMa  papar  haa  preaenced 
certain  prlnelplaa  CDaC  have  nee  been  ceated  yet  in 
a  reai  lapleeen cation;  there  still  reaaina  a 
simificant  aaeioie  of  reaearen  to  oe  dene  to 
doeonatrate  tneae  ideas  in  a  real  systea. 
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rarfaraaaea  aodaU  af  a  dlatrlhuta4.  raal-tiaa  aMsaad  aad  eoatral  ajataa  art  praaaatad.  laeladlna  aadala 
aaal7Sin«  ra«it>prapa4«tioii  and  aaaa«latad  fault  rdcorary  atrataflaa.  Tha  taeoaiduaa  and  taala  daaeritad 
ham  am  applieabla  to  tha  daal<B  and  analraia  of  dlatnbutad  apatana  la  ganaral  aad  am  mady  aad  aaallabla 
for  uaa  today  by  aodallaf  praositianan* 


' .  iirtropocTiow 

Truly  dlatrtbutad  ayataaa  am  naa  flaally  battaatac  ta 
appaar  in  aubataatlal  nuabar  la  infaraattoa  aaaacoaaat 
applieatlona.  Olstrtbutad  ayataaa  latroduea  a  nan  aat 
of  laplaaantatiott  taenatquaa  and  aaehaaiaaa  aad  thaa  a 
non  aat  of  parforaanea  liattln^  faetom.  Thaaa  faetam 
ineluda  <1  tha  additional  intarfaeaa  aad  aaehaaiaaa  tkat 
intaorata  tba  loeal  praeaaalad  at  a  flnan  alto  lata 
(labal  maauma  annlranaanta  aa  nail  aa  2)  tha  actual 
toata  of  ceaauaieatlon  aad  data  aanaaant. 

Thla  paper  daaeriboa  aa  latacratad  praotloal  fraaaaork 
for  tatmdueiad  thaaa  latarfaeao  aad  aaehaaiaaa  lata  tha 
porforaaaea  andallnf  of  truly  llatrlbutod  praeoaaln« 
jyotaaa.  Tha  papar  baeiaa  nith  a  eoneaptual  daaertptlaa 
of  toennlduot  for  dlatributad  «aaputln«  aad  the  added 
preeaoaihd  that  raoalta.  flua  addlttaaal  praooaaiad  oaa 
be  nionod  ta  a  hlomrehieal  faanion:  Tho  loneat  lanal 
10  a  lodieal  mrroaeatation  of  a  notnertc.  the  mpmaea- 
tatlon  nhlcn  no  uaa  ham.  for  tha  oaha  of  eoaemtonoaa. 
Id  eluatom  of  haota  eaaaaetod  by  lea«  dlataaao  llaka. 
Eaeh  elMtar  la  atmeturod  interaally  ualR<  an  Cthemot 
The  aect  hldher  lenel  la  tha  latorfaea  of 
taeoto  Proeoduro  Calla  (RPCa)  and  aeaaafoa  to  tha 
kOdioal  natnore  defined  by  paekot  tranaaiaalon.  RPCa 
toupla  nemal  loeal  pmeoaaiaf  la  hldhor  loeal  laaduaaao 
ntth  rnaoto  roaoureaa.  Noaaado  baaed  proeaaalac  aay 
alao  be  aado  rlalblo  at  the  applleatloa  loeal.  but  RPCa 
am  tha  aaehaaiaaa  that  uoat  aatumlly  latrodueo  maoto 
praeeaainc  lata  nomal  preeoduml  pra«raaa  far  lafaraa- 
tten  aaaa4|aoont  applleattona. 

Theao  dlatributad  praeaaalac  ayatoa  oloaaata  are 
daeerlbod  in  a  foraat  apprapriato  for  doeolopinf  perfor- 
aanee  eodala.  Thla  deaeriptien  la  than  deeeleped  in 
teres  sf  inforeatlon  proeosaine  (rapns  'IPCa).  a  4la- 
^rseesf.:  frseeeor*  for  coebin-.ne  eorxload.  syotan 
ind  hardeam  eonf leuntiona.  The  IPCa  san  bo 
Trajor-.  of  so  fonerie  eodola  froa  nhieh  jpeeifle  aealu^ 
ststl*  eodalo  can  be  sonatruetad  in  tamo  of  oeao  slau- 
ttion  'or  analytic)  aodallae  landuado.  tha  medollnd 
lanouaee  uaod  la  thla  prajaet  naa  the  Porforaaaea 
Analyat'a  derkboaeh  Syataa  (PAkS)  [lRA84].  A  llnltod 
nuaber  of  caplaa  of  the  PAV3  aadala  dooenbed  ham  am 
availablo  upon  mRueat.  The  eodala  daoeribod  ham  nam 
daeeloped  la  a  top>dana,  hlomrehieal  aanaor  la  ean> 
loaetloo  nith  tha  top-dona  daalfa  of  tha  dlatributad 
ayataa.  Tha  [PCa  ahana  ham  mflaet  thla  hiammhleal 
deeolopaeati  tha  laitlal  IPCa  doeuaont.  tha  flan  of 
infareation  at  the  hlahoat  laeela  in  tha  ayatoa,  and  aa 
our  explanation  proeeeda,  eaeh  ooapenent  in  tha  hlph- 
la«et  :?Ca  la  eipandod  until  an  appropriate  laeel  of 
detail  10  reaehod.  It  la  aatieipatod  that  thaaa 
eadalinp  taehniquoa  end  toola  saa  bo  uaod  to  eatond 


parferaanea  aodallnt  Into  tha  reala  of  distributed 
ayataaa  In  a  pmetleal  say. 

This  rapresantatlon  of  the  perforaeaee  of  distributed 
syataas  has  been  daeeloped  in  the  context  of  s  project 
for  dataralttlad  tha  eeats  of  Introdueiad  eery  hioh 
mllablllty  into  dlatributad  syataas.  This  project  is 
naaed  ZCQS  aad  is  based  on  tho  concept  of 

obioet-arlantad  pfa<maaln«.  Eaeh  entity  la  the  syston 
balanoa  to  sane  objeet  type,  and  an  object  inntanee 
eneepauletos  tha  data  structure  of  tho  object  and  tho 
opamtlaaa  that  can  bo  laeekad  a«alaat  it.  Objsets  am 
Idaatlfiod  by  a  syatoa-nida  oalqua  idontlflor;  oporae 
tlana  can  he  laeaksd  tdalnat  objaets  loeal ly  or  fron  a 
roaata  hast.  la  this  coatsst  tha  need  far  a  reeete 
praesdum  call  Is  obeioaa*  tha  latomst  in  this  project 
is  feenaad  upon  tha  mlatlee  eoau  of  syataa  exeeutian 
nith  aad  nlthaat  mllsbility  aoehanlsan.  Tha  systoe  no 
ara  aedallnt  has  net,  in  fact,  boon  lepleeenced.  The 
taehnlquos  no  doflna  for  soda  lino  distributed  systees 
do.  hessmr.  tppsar  ts  offtr  s  hmsdly  applicable  fmaa- 
eark  upaa  nhieh  to  attaad  parforaanea  oadals  of  a  tradi¬ 
tional  ttmetura  dlrsetly  ta  distributed  resourco 
onnimnadnta. 

Soetlon  Z  briefly  doaeribos  the  CPC  notation  and  the 
PAWS  lanpusqa.  Saetten  }  deaenbea  tho  physical  exanu- 
tlon  anTlronaont  ianeleed  la  thii  projaet.  Saction  l 
lllantmtas  hen  fault-frno  dlstrlbutnd  systona  can  bn 
nodolsd.  laeludlnq  nodolo  for  tho  RPC.  hoots,  and  coe- 
nunlestlon.  Saetlan  ?  Intmdueos  faults  into  physical 
raaoureot  and  donenatmtsa  has  thaaa  can  ba  propaqatad 
upnems  to  tho  usam  of  thaaa  rasourcao,  thus  pnnaittinq 
raeonary  atralafloa  to  ba  nedslsd.  Tho  RPC  can  ba 
rlowad  as  an  nxaapla  of  distributed  oyttoe  uoofo.  Other 
spplleetlens  could  use  the  pnysieaL  resourco  eod-ls  in  t 
sAStlor  fashion.  Thun,  ee  present  here  i  set  cf  tecn- 
nitu«s  that  can  be  used  to  nodel  i  «iie  'sr-.ety  3f 
distributed  systen  ussae. 

Z.  I5PCR»ATI0R  PROCSSSCRC  OR APRS  ARB  PAWS 

The  parforaanea  eodola  dooerlbed  hsrs  nore  doval  -  red 
ufllaq  Inforoatlaa  Proeassinq  Srapha  (IPCt'  and  the 
Parforaanea  Analyat'a  Workbaneh  Syatee  (TAWS).  Thu 
smtlon  brltny  doserlhaa  CPCa  aad  PAWS;  for  a  cooplatt 
daaerlptlan  aaa  [llABd]. 

Tba  uso  of  IKs  aad  PAWS  la  dlacrannad  in  Pleura 

2.1. 
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no*«  2.1.  fka  Om  af  tTOa  amt  MW 


Th»  aodalvr  b«ci>a  alth  an  MlatlM  or  t«-b»>4aaiO*^ 
raal  ijataa  aad  abatraeta  froa  that  tfataa  a  hlfh^laval 
pietura  of  tha  flow  of  inforaatioa  ia  tha  arataa.  fhla 
pietura  la  eallah  aa  IK.  Tha  datalla  of  aaah  part  of 
tha  pietura  ara  thoa  doelaraO  ia  tha  PiVS  alaalatlaa 
laB«ua<a  to  obtala  a  FiWS  aa4ai.  ahleh  ia  aaaioatad  bp 
tha  PAWS  alaaiator  to  obtala  parfaraaaaa  atatiatte 
aatloataa  aueh  aa  raapoaaa  tiaao.  threuihputa, 
utilitatioaa,  ota. 

2.1  IRPORIUTIOI  MOCBSIK  QUPNS 

Aa  IK  diacnaa  tha  flea  of  laforaatiaa  froa  raaaaraa  to 
roaoureo  ia  aa  iafaraatiaa  praaaaaiaa  apa*«*  (*•€•<  ••• 
Pipura  2.A).  Iafaraatiaa  proaaaaia«  arataaa,  aaah  aa 
eaapatiac.  eeaaaaieatiaa  aad  affiaa  apataaa.  aaa  ba 
tnaufht  of  in  taraa  of  aork  atatioaa  or  aodoa  at  ohieh 
iBforaatiOB  ia  preeaaaad.  ta  a  eaapatia«  apataa  tha 
nodaa  aap  rapraaaat  eontrai  praaaaaora,  diah  oaita, 
daviea  eeatrallora.  ate.  Iddao  haao  labala  doaotiag  tha 
fora  (traaaaetioa  eattaarp,  traaaaatioa  phaaa,  braaehiad 
probabiiitp)  of  inforaatioa  floa  alone  tha  odea. 
Inforaation  flowa  in  diaeroto  imita  eailod  trmaaetieiia. 
A  tranaaetion  ia  paaoral  raproaaata  tha  data  on  anieh 
the  nedaa  of  tha  inforaatioa  proeaaaiae  eraph  oparata 
and  in  particular  aap  rapraaaat  a  Jab,  procraa,  taak, 
aehodular,  aoaaaea  praeaaa,  poraan.  or  anp  aaah  aatitp 
uaoful  ta  tha  aodalor.  A  traaaaatioa  gatm  proeaoaod  ia 
tea#  oannar  at  each  nado  aad.  apoa  eoaplotioa  of 
praeaoaiae.  ao*aa  alone  tha  diraetlaa  of  aa  adfo  to  aaao 
other  oodo  for  additioaal  praeaoaiae.  Sararal  traaaae* 
tiona  nap  ba  aiaaltaaoaualp  aetira  (baiof  aparatad  upaa) 
at  the  varloua  nodaa  of  tha  IPG. 

*9»oei*to  •  eatoijorjr  and  a  £hn^  rlth  oneh  trnno- 
tc:;on.  Tho  eatogorp  of  a  tranaaetion  la  a  nano 
'  ianotad  tp  anp  string  of  alphanuaane  spnbola)  and  ia 
oarnanant:  a  tranaaetion  haa  one  uniguo  eatagarp 
inrougnout  its  lifatlaa.  Tha  phaaa  of  a  tranaaetion 
'danotad  bp  an  iatagor)  nap  ba  ehaagad  aa  tha  traaa- 
aetion  pregraoaaa  through  tha  IPG.  Tha  praeaoaiae  of  a 
tranaaetion  at  a  nado  and  tha  rootine  bahariar  af  a 
traaaaetian  fran  node  ta  nado  in  ganaml  dopond  on  that 
tmnaaetton'a  eatoearp  and  phnoa.  Tha  ganaml  fam  fdr 
donating  a  tmnanetian’a  eatagarp  and  phMo  ia  teatagarp 
nano,  phaaa  aimbarp. 

Thora  ara  fira  elaaaoa  af  nodaa  uoad  la  tha  IPG 
notation:  t)  roaoureo  nanaganont  nodaa,  2)  routine 
nedaa,  7)  arithnotte  nodaa,  a)  IPTEIRUPT  nodaa,  and  9) 
USCP  nodaa.  Each  elaaa  of  nodaa  ia  diaeuaaad  briaflp 
bo loo. 


(1)  Raaoarea  nanaganant  nedaa  rapraaant  apataa 
raaoureaa  Iproeoaaara,  aonerp,  eonnunieatton  links, 
diatta,  eappiag  naehiaaa,  paoplo,  ate.).  A  tranaaetion 
oomollp  mquaata  tha  uaa  af  eartain  raaourras  snd  nap 
havo  ta  guoua  (uait)  for  a  resourea  if  tho  rndjsst 
eannat  ba  fulfilled  iaaodiatalp.  thus,  raseures  nodss 
hnaa  guouea  aaaoeiatad  aith  than,  btsourtes  soy  to 
elaoaifiad  aa  aetira  raaoureaa  or  passivs  rrsourtss. 

Coneaptuallp,  an  aetira  roaoureo  is  souathing  that  oets 
or  rorka.  aueh  aa  a  proeaiaer  or  disk  amt.  An  setiio 
roaeureo  ia  rapraaontsd  bp  a  SEP VICE  nedo  in  PAWS.  A 
tmnanatten  arrirlnc  at  a  SBVICI  nado  raguaata  the  uaa 
of  tha  raaanma  for  a  apoaifiod  anount  of  tiaa  [uauallp 
dmun  fren  a  apoetflad  aarviea  ttna  diatribution).  If 
tha  maanma  ia  baing  uoad.  the  tranaaetion  auat  guoua 
(unit)  until  it  la  aehadulod  far  aarrleo  aeeording  to 
the  gnauaing  diaaiplina  apaaifiod  for  that  SQIVICE  node. 
After  roaairing  aarriea  tha  tranaaetion  atita  tha  nods 
alanc  anno  edge  to  anathar  node.  Pigure  2.2  shore  a 
partian  af  an  IK  mpraaanting  a  SBTICE  aoda  nonad  CK. 
Tha  elmla  tapraaantn  the  pmeaoaar  and  the  open  box  or 
aguam  raprooenta  tha  gnoua  for  raitiag  tranaaetiona. 

cm 

-30 

Plgnm  2.2.  SWTTCI  floda  CK 

A  poaatra  raaeurea  daaan't  itaalf  do  anp  work  but  is 
aanothlag  that  nuat  ba  peaaaaaad  bp  a  tranaaetion  ta  do 
aark.  Hanarlaa,  bnffara,  and  eontrai  paints  ara 
axaaiplaa  af  pnaniaa  raanuraaa.  Paaaiaa  ratouraaa 
nanallp  aannr  in  granpat  far  ananpla,  nanerp  nap  bo 
ragnrdod  aa  a  group  af  pagaa.  The  anaunt  of  tine  a 
paaaiaa  raaeurea  la  held  bp  a  tranaaetion  is  not  sprei« 
find  bp  a  oaraiee  ttna  dtntrlbntion.  After  aeguiring  a 
paaaiaa  raaaures  (aaah  aa  aonerp).  a  tranaaetion  tppi> 
eallp  uaaa  one  or  aara  aetiaa  roaeoreoa  (proeaaaors, 
dlaha,  ate.)  bafa'ro  ralaaalng  the  paaaiaa  raseurea. 
fhna,  a  paaaiaa  raaeurea  la  ropraaontad  ia  sn  IPG  bp 
two  nadaat  ana  at  ahieh  tha  raaeurea  is  negutrad  and 
ono  at  ahleh  tho  rooaareo  is  rolaasad.. 

fhara  ara  tne  tppoa  of  paaaiaa  raaoureaa:  TOKEHS  and 
REKOIICS.  TOUilS  ora  ooguirud  at  ALIOCATE  nedaa  and 
ralaoaod  at  RBLIASC  nodaa.  NENOIIIES  ara  aeguirad  at 
GETHW  nodaa  aad  mlaoaad  at  REMEh  nedaa.  TOKEW  nap 
bo  uaad  to  nodal  Input  and  output  buffers,  ehannala, 
pagoo,  donataa  or  eontrai  peinta.  eoanuaientlon  Links, 
and  other  paaeira  roseuresa,  TppiealIp,  a  saparato 
token  la  used  to  rapraaant  each  raaeurea  (buffer,  page, 
ate.),  and  thaso  tokana  ara  partitioned  into  type  «ith 
one  token  tppe  for  aneh  tppa  of  resource  (input  suffers, 
aain  uoaorp  pages,  etc.).  kEhORIES  srs  used  to  eodel 
eontigueualp  addraasad  psssiva  raaouress  such  as  earn 
aonoriao,  aftandad  cere  atornga,  and  disk  specs. 
Assoelatad  alth  each  aonerp  la  a  itaaorp  nanagenant 
adhaua  noeardiag  ta  uhieh  bloekn  of  aonerp  ara  allocated 
to  traaaaotlana. 

T0IC2IS  do  n^  hare  to  ba  EKlEASEd  to  tha  node  at  ehleh 
thap  aara  TTbOCATtd.  At  a  RnCASC  node  a  trsnsootlon 
nap  aponlfp  anp  ALIOCACT  node  to  ahleh  the  tokana  ara  to 
bo  mlaaaad.  TOkkW  anp  ba  eraatod  at  CRRATR  nodaa  and 
doatropod  at  MSTkOT  nadao.  Plguro  2.3  snaae  a  partian 
of  an  IK  In  ahtoh  trunaaetiena  a)  aegulra  BUPPEE  tokens 
at  tha  ALUCATE  node  naaod  CR,  b)  Croats  BUPPEX  tokens 
for  GR  at  the  CKEATE  node  nanad  RAKE,  e)  dastrop  SUPPER 
tokens  for  GR  at  tha  DETTROT  nods  nanad  KILL,  and  d)  at 
tho  RBLIASt  nado  nsasd  KT,  rslsase  BUPPER  tokana  for 
tha  ALUKATt  oada  noMd  CRRORE. 


161 


*aa«el«c*4  «tth  Mek  aed*  at  vhteli  raaaareM  (•ettva  ar 
paasiva)  ara  aeqvlrad  la  a  quaualaf  dlaelpllaa.  l.a.. 
tha  dtaelplina  aeeordlnf  to  which  traaaactioiia  on^uaua 
if  tha  raaourea  la  not  avallabla  laaodlatolp. 


— — (±1-  B - - □i 


SR  nua  Ru  nr  aRinii 

Plcara  20.  UlACATI.  Ciun,  DBIMT.  IBIUS - 1- 

Tha  rata  at  which  aa  aetlva  roaoarea  proaaaaaa 
laforaatloa  or  tha  abllltp  of  a  traaaaetloa  to  aequlra  a 
paaaiwo  raaourea  aap  dopaad  oa  owaata  oaauiTla<  la  other 
parta  of  tha  apataa.  la  aaeh  eaaaa,  a  traasaetioa  at 
a  SR  noda  aap  roquoat  a  aodlflcatloa  to  tha  aarwleo 
rata  or  powor  of  a  SBTICS.  auocaTB.  or  SRRCT  aoda. 

(2)  Poiittn^  nodoa  aap  ba  uaad  to  eroata  aad  daatrop 
traaaaetlooa  aad  to  altar  traasaetioa  flow  throuch  tha 
apatoa.  Thera  ara  all  tppaa  of  raatlag  aodaat  SOOICt. 
SISR.  PORK.  JOIR.  SPLIT,  aad  SIARCH  nodoa.  At  a  SOORCB 
noda,  traoaaetloaa  are  craatod  (arriwo)  porlodleallp 
aecordlsd  to  a  uaar-opaelflad  latararrlwal  tlaa  dlatn* 
butlon.  At  a  SIRK  noda  traaaaatlosa  dlaappaar  fraa  the 
apatoa  forowar.  A  traaaaetloa  aap  apaaa  a  auahar  of 
children  traaaaetlona  at  a  FORK  noda.  aad  tha  ohltdraa 
aap  coalaaea  at  a  Julii  node  to  roeraata  tha  paroat. 
FORK  and  JQIR  aadaa  ara  uaaful  far  aadallaR  the 
apnehroalMtloa  of  eoaearrant  praeaeoaa.  A  troaoaetiea 
aap  craata  a  auabar  of  3IBLIHC  traaaaetloas  at  a  SFLIT 
noda  o  aueh  Uko  a  JOIR  noda.  The  traaaaetlona  eraatad 
at  a  SFLIT  noda  aap,  for  laataoea.  ha  uaad  ta  aadal  tha 
operation  of  aaaaaaa  oaoaualeatlon.  BIARCR  nodoa  aap  ba 
uaad  to  faellltata  tha  apoelf leatlon  of  branehlaf 
(routine)  probabiiltlaa  tod  to  collaet  atatlatlea. 

Fl«uro  2.d  llluatratoa  tha  uao  of  tha  routine  aadaa. 
Each  trwnaaetlon  antara  tha  apatan  at  tha  aourea  noda 
START  and  proeooda  to  tha  fork  noda  TFORK,  whara  tha 
tranaaetlon  craataa  two  ehlldran  aad  walta  for  tha 
children  to  Jela. 


Fleaaa  2.4.  Raatlac  ■hMipla 


Caeh  child  raquaota  aad  uaaa  a  preeanaar  (ACPV  ar  ICTV) 
bafora  praeoadlac  to  tha  jola  noda  TJOIH.  Aa  aeon  aa 
both  ehlldran  of  a  tranaaetlon  reach  TJOIR  tha  ehlldran 
dlaappaar  and  tha  paraat  (atlll  waltlae  at  TFORK) 
i-aplaeoo  Ita  children  at  TJOIR  and  praeaada  fraa  TJOIR 
to  the  apllt  nods.  T3FLIT,  whara  a  sibling  traaaaetloa  la 
eraatad.  The  naalp-eraatad  alblinf  proeoada  to  tend  a 
aoaaaea  bafora  laawlnc  tha  apatoa  at  the  alah  aada 
TSIRK.  Tha  orlclaal  albllae  (tha  tranaaetlon  that 


atartad  at  RUT)  trawala  frea  TSFLIT  to  tha  branch  noda 
TnARCH,  frea  which  It  eoao  back  to  TFORK  with  prsca. 
bllltp  0.9  or  to  TSIRK  with  probabllltp  o.i. 

(3)  A£l_thoat^  nodoa  ara  uaad  to  carry  oat  towpa-.*- 
tlonal  atspa  and  to  nodlfp  nlaulatlon  wariaolts.  Th<r> 
ara  two  tppaa  of  arlthaotlc  nodes:  CORF'JTE  .nodes  snd 
CHARGE  nodes.  Eaeh  tranaaetlon  has  loesl  /sr-.ad'.rs 
assoelatad  with  It.  la  addition,  tha  network  has  soaa 
Global  wanablas  assoelatad  with  it.  a  C0RP9TE  noda  is 
uaad  for  aaalfnaaat  of  waluas  to  aad  conditional  opora- 
tloas  on  thaao  wariablas.  A  CHARGE  node  la  uaad  to 
ehaafo  tha  phaaa  of  a  transaction  (probablllatleallp). 
la  Flfuro  2.9.  a  traaaaetloa  arrlwinc  at  tha  eoaputa 
node  ACORF  laeraaonta  a  (lobal  warlable  naaad  COURT  and 
proeoada  ta  the  CHARGE  noda  naaad  SVITCH.  which  chanGwa 
tho  traasaetlon's  phaaa  froa  I  to  2  or  2  to  t. 


ACORF  SWITCH 


Pl«nM  2.9.  ArlthMtia  Hodoo 

(4)  Aa  IREIRUFT  aado  la  uaad  bp  one  transaetloa  to 
latormpt  the  proeasala<  of  another  tranaaetlon.  Tha 
latarrupted  traaaaetlen  laaodlatolp  departs  free  tha 
aoda  Ohara  It  waa  latamptad  with  a  now  phaaa  asslfaad 
ta  It  bp  Its  latarmptar. 

(9)  A  transaatlan  arrtwinf  at  a  USER  noda  Inwokaa  a 
usar«wrlttan  FORTIAR  snhreetlno  that  has  aeeass  to  tha 
Global  warlablae  aad  the  tmaaaetioa's  local  variables. 
The  USB  aado  faellltp  sahas  PAWS  aa  axtanslbU  apatan. 

2.2  THE  PARS  URCUAGE 

The  PAWS  IsnGuafe  la  daelaraUj^  rather  than  procedural: 
the  uaer  slaplp  iaelnrao  tha  eharnetsrlstlea  of  tha 
apatoa  bolnG  aodalad  as  opposed  to  codlne  detail  slitiila> 
tien  alhorlthas.  Aa  cwaapls  CPU  noda  definition  in  the 
PAWS  lanGuaae  la  shewn  In  flRura  2.6. 

CFG 

TTFB  SEETICE 

GOARTTTT  1 

99  RRFO  '0.0 

RBGUER  <BATeH.ALL>  HTFOK  tO.O.  t  A.  1) ; 

FlGnre  2.6.  banplo  Rode  Dsnaltlaa  la  PAWS 

Hare,  tha  naaa  of  tha  noda  la  CPU.  tha  noda  typ#  is 
SERVICE,  there  la  <  sarver,  tha  puaualna  dlsrlplin*  i» 
reuad-robln  fixed  quaatun  with  a  fixed  quantua  of  <0 
tlao  units,  and  all  BATCH  transactions  racardleas  of 
phase  reRuaet  sameo  tlaes  drawn  frea  tha  hppor-axpe* 
aantlal  dlstrlbatlon  with  naan  10.0  aud  standard 
dswlatloa  14.1. 

3.  PRTSICAL  EXECOnOR  BRTTRORRBRT 

Aap  aatwsm  oaf  tears  anst  oaocata  oa  an  nadarlplsG 
apatan.  Tho  phpateal  apatoa  lawolvod  in  tho  ZEUS 
project  eenalats  of  eluataro  of  hosts  that  are  connoeted 
bp  loon  dlataneo  lloks.  The  hesta  In  a  cluater  ere 
eonnoetad  uslaf  a  CSAR/CD  local  area  network  eueh  ee 
Ethomot.  We  aoonae  that  a  noasaao  la  broken  up  into 
paekata  at  ita  source  and  rs-assenblad  at  its 
daotlaatlen.  This  lapllso  a  unlwaroal  packet  atraeture 
that  all  tho  hoots  la  tho  notwork  understand.  Tha 
network  anb«nodol  deserlbad  la  aeetlon  A.1  aaeuaes  this 
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•adarlrlM  phyateal  aaaeatiaa  aaylraaaaat.  farartlia- 
laaa.  tba  (aaaral  yrlaetplaa  praaaa«a4  ta  thla  papar  ara 
aatlraLy  iB^apaadaat  af  that  aaTlroiMwat. 

4.  TRt  pawn  noeamt  cxu. 

In  thi*  laetton  *a  davtlop  a  aodtl  for  a  Raaota 
Proeadura  Call  UPC)  froa  tha  applleatloa  laval  to  tha 
laval  of  packata  traasalttad  through  tha  nataork.  Tha 
RPC  aodal  of  dlatribucad  profraaamk  paraita  ita  uaars 
to  aeeaaa  raaota  profraaa  uaing  proeadura  or  fuaetton 
call  aaaantiea.  Thus,  an  RPC  can  ba  viaaad  aa  a  call* 
raapoaaa  aaasaaa  pair.  A  nuabar  of  aehaaaa.  laeludlnf 
tha  ona  proaantad  haro.  ehooaa  to  laeorporata  aaknoal. 
adcaaaata  for  ealla  aad  roapoaaoa.  Pifaro  4.<  ahoaa 
tha  IPC  for  aa  iaataaeo  of  a  call.  Thoro  ara  throa 
Bodaa:  SETCALL.  RPC,  aad  SECYAL.  Tha  aada  RK  la  a 
eoaplat  (i.a..  hi<h>laaal)  PAWS  aoda  aad  alll  ba 
aspandad  latar  la  thla  aaetloa.  A  eoaplaa.  hlch-laaal 
aoda  la  daaotad  la  IPCa  by  doabla  aartteal  aad  bara. 
Tha  aodaa  SETCAU  aad  SEETAL  ara  PAWS  COMPUTE  nodaa  that 
aat  tha  eall  paraaatara  aad  laapaat  tha  raaalta  of  tha 
eall.  Tha  eall  paraaatara  ara  eall  aaarea,  eall 
deatlaatlea.  aaralea  roRUlrad  aad  eall  also.  Thoaa 
paraaatara  alll  dataralao  tha  aaraleoa  roRulrod  by  tha 
eall  aarouta  aad  at  Ita  aourea  aad  daatiaatloa. 


Pi«BTC4<l.  IfO  fhr  aa  IK 

Plcvra  4.2  thoaa  tha  aoquaaea  of  aaaaadaa  a<ehaa«o4 
durinc  aa  RPC.  Tha  eallor  eaa  proeaod  tftar  a  raapoaaa 
haa  baan  rtealaad.  tilhoalsa  tha  eallaa  eaa  eaatlaaa 

after  tha  raapeaao  la  aaat. 


■;a, - >  ■ 

Q  cmil 

esllar 

©  call  sek 

esllaa 

Q)  rv«poiia« 

0  raapoaaa  sek 

Plaara  4.2.  Tha  llaaaa«o  Flaa  far  IK 


ta  Pleura  4.3  aa  daaalop  tha  IK  far  tha  aaaaaaa 
toquoaea  of  Pleura  4U.  Thla  IK  la  aa  awpaaaloa  of  tha 
eoaplaa  RPC  aada  af  Pleara  4.1.  All  af  tha  traaaaatloaa 
that  floa  la  thla  IK  are  of  the  eataeary  CALI,  thaa 
paraittine  a  tlapla  atrueturtne  of  tha  ayataa.  Tha  CALL 
entaeory  uaaa  tha  eoaplat  neda  RET  to  dallvor  aaaaaeaa 
froa  nott  to  noat.  The  eoaplat  node  TWO  la  uaad  to 
proeats  ealls  at  a  neat.  Tha  noda  SETPARR  sata  tha 
paraaatars  of  a  aaoaaea  (l.a..  alloaa  tha  tourea,  daatl. 
nation  and  typa  of  aaaaae*  to  ba  sat).  Althoueh  not 
shoan  in  Pleura  4.3  tha  dlffaroat  phaaaa  of  tha  eataeory 
CALL  are  uaad  to  dlatlaeutah  botaaaa  the  dlffaraat 
aoaaaeoa  tent  ta  aehleaa  tha  IPC. 


!«-aU  aak> 


Pleare  4.3.  IK  for  tha  RR 


A  traoaaetloa  aatarlae  the  RR  Sab>asdol  eraataa  a  child 
traaaaatlaa  at  tha  CPORK  aada.  Tha  ehlld  tranaaetlon 
rapraaaata  the  raaota  proeadura  ealls  tha  parent  trana. 
aetlaa  raaaiaa  la  llabo  at  CPCIK  uatll  tha  child  trrlvat 
at  CJOII.  Tha  ehlld  praeaada  ta  SETPARR.  uhara  the 
aaaaaca  paraaatara  (aaarea,  daatiaatloa,  and  type  of 
aaaaaea)  ara  aat,  aad  thaa  ta  RET,  uhleh  dellvrrt  aat* 
aaeao  froa  boat  to  heat  aad  la  doaerlbed  In  detail 
latar.  Oa  aalt  froa  RET,  tha  ehlld  haa  arrleed  at  Ita 
deatlaatlea  aad  praeaada  ta  C3PLT1 ,  uhara  it  eraataa  a 
olbllae  traaaaatlaa  ta  aodal  the  eall  aelnieul*deaaen.t. 
laaadlataly  after  this  spilt  (l.a..  In  parallel  ulth  its 
nauly  eraatad  slbllne)  the  ehlld  preeeads  to  TRD.  uhleh 
aodala  tha  eall  preeaoslae*  C"  eilt  froa  THD.  the  child 
aota  Ita  aaaaaea  paraaatars  at  SETPARR  aad  Inuokaa  the 
■IT  aub.aadel  to  aodal  the  rasponao,  after  uhleh  the 
child  eraataa  aaothar  slbllne  tranaaetlon  ta  nodal  the 
ruopenaa  aekaaslsdeoooat,  laaadlataly  after  this  split, 
tha  ehlld  parfena  a  Jala  operation  at  CJOIR,  uhleh 
raoouas  tha  parent  froa  llabo  at  CPORR  and  aanda  the 
paraat  ta  CSIK  ta  depart  the  RR  sub>oedol. 


Wa  float  davolep  tha  eoaplaa  nods  RET. 


4.1  THE  RETVORK  3UR.R0DEL 


This  section  presents  an  expansion  of  the  RET  coaplex 
node  la  Pleure  4.3  and  aasuses  a  physical  execution 
enulronaant  as  dasenbed  in  suction  3.  The  overnll  IPC 
la  elRan  la  Flenra  4.4.  Irlafly  after  seas  eonputation 
at  tha  aourea  host  the  aaaaaea  la  brohon  up  into  packets 
baaed  as  Its  also.  Tha  IK  uses  tha  fork. join  pair  PF 
aad  PJ  to  nodal  the  paehatlsatlon  aad  re-aaaaobly  of  the 
aaaaaea.  Tha  eoapatatlon  dona  uhaa  the  aaaaaea  is  sent 
sad  roooluod  Is  oodalod  by  tha  eoaplat  nods  CORF  uhleh 
ruptuaaato  boat  easpotatlan.  Iseb  eonaratad  packet  seas 
throaeh  the  aotuorli  uia  tha  eoaplat  node  RTSRV  uhleh  Is 
daoarlbod  nast.  Rotles  that  ulthla  this  IK  thoro  are 
throa  eataeorlaai  baforo  RP  and  after  RJ  there  Is  the 
eataeory  CALL,  botuoon  RP  sad  RJ  there  is  a  eatseory 
RSC,  aad  batuasn  PP  aad  PJ  there  ta  a  eataeory  PKT. 
This  usa  daao  baeauoo  ua  felt  that  tha  paeketlxatlon  and 
ro.asaaably  proeeasoa  sro  a  part  of  tha  oeasoeo  leuel. 
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Th*  nCTSn  eeaplas  sod*  daaerlbM  has  paekata  ara  routad 
throudB  tha  aataotk.  Eaek  hep  threoch  the  aataerk  earn 
ha  tbreuph  a  lan«  dlataaea  llak  (LSL)  ar  a  local  araa 
eataerk.  da  dafiaa  atatlcallp  for  aach  pair  of  olaatara 
la  tha  natvorfe  a  act  of  patha  that  will  paralt  traaaaia- 
aioa  froa  tha  aeurea  eliiatar  to  tha  daatlnatloe  eluatar. 
Tppleallp  a  paekat  oill  pe  froa  aeurea  to  tha  flrat 
pataaay  via  a  Uf  aad  thaa  via  aaaa  eeahlaatloa  of  LOba 
aad  LAPa  to  tho  daatlaatloa  heat'a  eluatar.  Plaallp  tha 
paekat  olll  po  via  the  daatlaatloa  heat*a  UP  to  tha 
daatlaatloa  heat.  Tha  nCTSPO  aedo  la  atpaadad  la  Plpura 
i.f.  Tha  noda  IPIbC  aata  up  tha  path  paekat  vharaaa 
RITbC  dataralaaa  tha  aokt  lap  of  the  path  the  packet 
vlll  taka. 


Plparo  4.%  TIm  niSPS  Oeaploa  Rada 


Caeh  PAWS  traaaaetloa  earrlaa  alth  It  laforaatlea 
Identlfplap  tha  apoelfle  link  or  local  araa  notaerk  the 
tranaaetion  vlll  vialt,  aleap  with  tha  aoarea  aad 
daatlaatloa  patavap  heata  oa  the  llaka  ar  looal  araa 
natvorka.  Wa  nact  davalap  the  aodala  for  tha  leap 
dlataaea  llnka  aad  the  laeal  araa  aatverka. 

A. 2  A  mOCL  POP  A  bORC  OISTARCe  bIRK 

The  ’.one  dlstanea  Link  at  Ita  slaplaat  san  bo  aodolod  «a 
<  first  tone  first  served  queue.  Wa  Hava  ehoaon  not  to 
sodel  packet  leknowLedpaaonta  at  this  level  ttaouph  this 
tan  be  included  la  tho  aedal.  Tho  IPG  la  shevo  la 
Pleura  t.6.  Thera  bPK  la  a  POPS  queue  vharaaa  tha  too 
T3hP  coaplas  nodoa  ropraaaat  eoaputatloa  at  tha  aourea 
aad  daatlaatloa  of  the  link. 


Plparo  A.t.  Nadel  for  the  beep  Blatahae  bUfe 


Tho  node  bRK  could  alaplf  ha  a  dalap  alaea  ve  aaaoao 
that  paekats  ara  all  tha  aaaa  also.  Plaallp  to  aadal 
packet  ratraaaolaaloo  ve  have  aa  sdpe  la  tha  IPG  that  a 
traoaaetlao  Ukao  sbeold  tha  paekot'a  data  ha  eorruptad. 


A.3  A  ROBB,  pot  THt  bOCAb  MCA  RCTVORIC 

The  paekat  that  aotara  tha  bAR  aub-aodal  knovs  Ita 
sourea,  Ita  daatlnatlen  aad  tha  Ctharnat  oa  ehicn  :t 
auat  be  broadeaat.  Purthar,  et  taj  pivea  tiae  s  tost  tan 
attaapt  to  broadeaat  oalr  a  sinpla  packet  on  the 
Internal  to  the  LAR  tha  latersetlons  are  sore  tova.ei. 
Theae  are  tharoforo  davalopad  In  the  tonplex  node  ?Tvr?.. 


PipM*  4.T.  IMel  ef  the  bAR 

Tho  COUP  aodaa  apala  repreaont  eoaputatloa  aad  tha  GTblR 
aad  Rblf  rapraaaat  uaape  of  a  tekaa  to  ananra  that  any 
heat  vlll  attaapt  to  hroadeaat  oalp  oaa  paekat  at  a 
ttae.  Wa  nett  davalep  a  aadal  of  tha  RHER  eeaplas 
aedai  Ptpare  A.S  aheva  tha  eerreapeadlnp  IPG.  Wa  note 
that  Ob  (<aervlee_neda>)  la  tha  eurreat  value  of  the 
quotto  lanpth  at  that  aarvlea  aoda.  In  thla  aodal  there 
are  three  aarvloe  aedaoi  ••dal.  P>dol.  and  T>dal;  and 
tve  eeaputa  nedaat  SRBACX  aad  smoVT. 

The  haale  aetloa  behind  a  eSNA/CS  protoeol  la  that  a 
broadeaat  haa  tve  phaaeat  prepapatlon  and  tranaalaaion. 
Oarlap  prepapatlon.  packet  eelllalona  can  occur.  Durinp 
traaaelaalon.  the  enrrlar  aenae  aoehaalaa  eauaaa  tha 
other  heata  to  held  their  paekota  for  aeaa  baek>ofr 
period. 


TP  •  prebabllltp  of  tranaalaaion 


Plpara  4.a.  Setalla  ef  CSRA/CB  prateael 


If  a  paekat  (traaaaetloa)  arrlvea  vlth  Ob(P>dol)  and 
Qb(T-dal)  equal  to  taro  than  tho  net  la  net  buap  so  tho 
braaehlap  probabtlltloa  RC  aad  TP  sro  sot  to  one.  This 
la  tho  noraal  ease.  Tha  paekat  vlll  salt  uatil  its 
prepapatlon  dolap  P«dal  la  over.  Than  if  no  collision 
has  oeeurrad,  the  paekat  vlll  preeaed  to  T>dal  (baeausa 
TP  la  atlll  sat  to  I)  aad  sill  than  depart  tho  CSRA/C9 
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rrotosol  If  •  eolltaloB  haa  aanrrad  alilla 
tka  aaakat  la  at  r.dal  thaa  TP  alll  liava  baaa  raaat  to 
tare,  aa  tha  packet  alll  ca  to  SETSACK  ta  eaaputa  Ita 
kack>aff  dalaj  aad  thaa  te  l-dal  aalt  far  that  tiaa, 
after  ahteh  It  will  ratrr  traaaalaaleo. 

If  a  packet  arrlwaa  whan  9L(P>dal)>0  than  ana  ar  aare 
paekati  bafara  it  ara  undartalnii  a  prapa«atlan  dalap. 
Thaaa  pravxaua  packata  auat  ba  blacked  fraa  gaing  an  ta 
trinaauaian  daUr-  Tharafart.  tha  naaly  arrlaxng 
paeka:  tatt  T?  ta  taro  and  proeaada  ta  P-dal  aa  that 
the  raatdual  affect  af  ita  braadcaat  will  ba  fait  by  any 
aubaaquant  packata.  After  aaltiac  at  P-dal  tha 
callldlhx  packet  will  laaaa  (atnea  TP  la  aara)  ta 
calculate  ita  baek>aff  delay  and  thaa  ta  tha  baek>aff 
delay  neda  B^al  after  which  it  ratrlaa  tmaaaiaaian. 

If  a  packet  arrlaaa  whan  QI,(T>dal)>0  (it  can  anly  ba  l) 
than  the  packet  aata  HC  ta  aara  and  tbua  reutaa  Itaalf 
ta  tha  backaff  delay  ealeulatiaa  nada.  Praa  there  it 
daea  ta  tha  backoff  delay  nada  and  than  to  retry 
traoaaiaaian. 

A. 4  THE  kODEL  OP  A  HOST 

Tha  haadllnc  of  calla.  aaaaacaa.  and  packata  naeda 
caaputatlan.  Tha  corraapandiad  tranaaetiaaa  in  our 
aadal  aaaka  coaputatiaa  at  tha  haata  they  aiait.  Tha 
eaapntationa  haaa  bean  repraaantad  aa  coaplaa  nodaa  in 
tha  praaioua  IPCa.  Thia  aubaactlan  praaaata  aa  aapandad 
aadal  af  auch  coaputatiaa. 

Va  aadal  a  heat  aa  baiod  a  aat  af  phyaicai  raaeureoe  and 
a  aat  af  pracaaaaa  each  af  which  ataeataa  ana  or  a  filed 
aat  af  caapautiano.  A  PAWS  tranooctiaii  that  aaata  ta 
aaaka  a  eaapathtian  an  a  heat  will  ioataatiata  a  procaoa 
aa  that  hoat.  Thia  praaaaa  ia  aodalad  by  a  PAWS 
traaaaetian  af  eatadory  CORPT  that  will  aiacuta  eoa  af  a 
pat  of  eoaputaticna  baaed  on  tha  paraaatara  aat  by  tha 
inatantiatcr.  Tha  pracaaaaa  that  aiaaata  aa  tha  aaaa 
hoot  are  aodalad  by  SIIUSG  traaoaetiana,  the  odraatodo 
of  which  will  ba  paiatad  out  ia  tha  nait  oactiao. 

Wa  aaauna  tha  phyalcal  raaaurcaa  of  a  heat  ta  ba  a 
alnplo  Oantml  Server  kodal  (CSX).  Thua  a  conpatation 
alalta  the  CPU  and  an  l/a  davioa  ia  auceaaaian  ana  ar 
aara  tinea.  At  all  tinea  a  aindla  PAWS  tranaoctian  of 
eatederr  CORPT  la  ellacatad  te  a  heat,  nia  traaaectiea 
renalna  blocked  at  tha  ALUCATB  node  KSTBUC  ae  ahean  la 
fldure  4.9. 


wnn  HOtEL  WTCOHP 


Pldaro  4.9  illuatratac  the  Inatantlation  and  axaeutlon 
af  tha  ceaputatioa  pracaaaaa  by  aaaa  uaar  af  a  host. 
Each  heat  haa  a  parnanont  tranaactlon  that  ts  blocked 
iodafiaitaly  at  HSTBLK.  A  heat  uaar  flrat  enters  the 
•tttuel  excluaion  rwflon  bounded  by  the  ILICCITS  node 
RECET  and  the  RELEASE  node  RStEL  In  the  upper  The 
uaar  aata  hla  identity  and  tha  typo  of  tonputetuon  -• 
requlroa  oxacuted  in  sane  dl^bal  veriasLe.  He  ‘.r.en 
Interrupta  tha  tranaactlon  that  represents  tie  net  te 
wlahaa  to  exaeuta  on  at  node  HST3LX.  The  unterrupt:.  i 
tranaactlon  then  blocks  at  WTIhl.  The  interrupted  tost 
tranaoctian  the  paranetera  stored  in  the  veriaoles 
aad  uaaa  thaa  to  iaatnatiato  a  sibllnd  conputatlon  at 
OERCORP.  Tha  hast  trananetion  than  latamipta  tha  host 
uaar  trananetion  aaitind  at  nada  WTIRI  at  node  CRIhl. 
Tha  heat  trananetion  than  doaa  back  te  nada  HSTBLX  to 
await  tha  naxt  intamiptien.  The  hast  user  tranaactlon 
oiita  tha  autunl  axclusloa  region  by  roleaalng  tha  token 
at  HBEL  and  then  dooe  te  neda  WTCORP  to  await  tenlna- 
tlan  af  tha  iaitiated  eeapatntlon. 

The  lal tinted  eeapatatien  kaeae  the  identity  of  the  heat 
ueer  tranaaetien.  On  baind  eroatad  it  parforns  the 
deaimd  eenputntlon  at  tha  eeaplax  node  CSR  and  then  at 
node  COHPPtBI  iatermpts  ita  eerreapendind  haat  user 
tranaaetien  at  neda  WTCORP. 

The  IPS  af  Plgore  4.9  illuatmtao  haw  autual  excluaion 
can  ba  nadeled  with  PAWS.  Here  inportantly  the  hoat 
traaaaetian  ia  at  any  tine  the  3IBt>IltC  af  ell  of  the 
eaapntatien  tmaaaetiaaa  that  are  eieeutlnd  la  tha  aedas 
ef  tha  CSR  eaaples  neda.  Thua  it  can  interrupt  all  of 
than  alaultanaaualy  althaut  being  aware  af  each  of  their 
idontitiae.  Thia  ia  a  faet  that  wo  will  taka  advantage 
af  in  tha  fellewiag  eeetien. 

4.9  A  UOE  BACK 

What  have  we  aeceaplishad  aa  far?  The  chief  benefit  wa 
have  aahiawed  la  the  ability  te  davolap  IPOs  and  thua 
slnulatioa  aedola  af  dlatributad  syetens  in  a  clear  tap> 
down  nanner.  Ia  addition  wo  hove  dovaloped  aodels  for 
bath  haata  and  local  area  notwerks  that  ere  intuitive 
and  easy  ta  understand.  We  next  tackle  the  task  of 
latredueiag  failures  into  tha  aedel.  ■ 

?.  ROPEtlRS  PAUtTS  IB  DISTBIBUTEB  STSTERS 

We  nest  addrese  the  preblae  af  aadellng  faults.  Thera 
era  a  naaber  ef  faults  that  sen  occur,  within  e  hast, 
individual  starndo  dawlcoa  can  fail.  In  addition,  a 
haat  eaa  fail  due  ta  CPU  or  anin  nooery  failure.  Ia  the 
notwerk,  if  a  liak  fails  then  aaao  eenauaicatiaa  capac¬ 
ity  will  be  loot.  If  a  dateway  falls,  then  all  the 
links  that  are  connected  to  it  will  fail,  finally,  e 
cluster's  local  area  network  can  fall,  affectlnx  ill  the 
eeasunicatlons  that  use  the  local  area  network  to 
achieve  coanunieatlon  between  two  clusters. 

How  da  wo  Intredueo  faults  into  the  systea'’  faults  can 
bo  aedolsd  oa  a  special  tranaaetien  category,  for  each 
raoenrsa  that  eoa  fail  there  will  be  aa  laatanee  of  this 
traaaaetian  eatega<*y  that  dW"ef*A*>  ‘bA  raeavera 
failnrea  aeeerdiag  ^  a  apocified  Tine  Between  failure 
(TBP)  distribatian  and  a  Tlaa  Ta  Repair  (TTB)  distribu¬ 
tion.  Obwleualy  there  will  be  different  aetiens  that 
need  te  be  perfsmed  fer  eaeh  dewlea  failure.  When  a 
reeeuree  fallare  aeenrs  all  the  tranaaetlons  that  were 
naiad  the  reeaurae  aunt  ba  Infornad  af  tha  failure  so 
that  they  can  siaulate  the  recovery  actions  in  the 
systea  If  aaceosery. 

A  fault  trsnsaetian  can  be  ueed  ta  siaulate  a  coabina- 
tloa  af  related  failures,  fer  exsaple,  if  the  shared 
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■•■or;  9r  •  aultlproecaser  ayataa  falla  thaa  all  tha 
ftoeaaaara  of  tka  aalttproaaaaor  wymtum  atll  alao  fail* 
tfa  aoa  illuatrata  haa  »a  hava  aodalad  falluraa  la  tha 
ladivldual  heats* 

9.1  comvrmos  fiiufflES 

Raeall  fraa  saetlan  that  a  heat  eanalats  of  a  sat  of 
rssourcaa  and  a  sat  of  ralatad  proeassaa  (aedalad  as 
?k^S  transactions)  ssaeutlac  oa  it.  la  ficura  4.9  aa 
dapietad  this  aa  tha  eeaplaa  aeda  CSX.  la  this  saetian 
aa  axpand  that  oada  into  Pleura  9.1. 

lia  haaa  asanaad  that  aur  apstaa  is  hoaaaanaua  la  that 
each  host  has  tha  saaa  eonfieuratiaiu  Va  haaa  deaa  this 
la  order  to  heap  tha  siaulatiaa  aa  aiapla  aa  poasibla. 
Thus  each  host  eeataiaa  thraa  discs  raprasaatad  as 
eaaplss  ^as. 


Tigutm  9.1  •  the  Nash  eaafl«B»ati«n 

Tha  trsaaaetiaas  aatar  at  sada  StTCOKI*  ahieh  dacidaa  tha 
nast  phase  af  tha  caaputatlaa.  Tha  phaaa  af  a 
eaaputatian  is  tiaad  to  dataraiaa  tha  raquaotad  CTC  sad 
disc  sarvies  tiaas  and  whathar  tha  eaaputatian  has 
taraiaatad.  la  Pidura  9.2  aa  daaalap  tha  eaaplaa  nada 
DISCI !  tha  ethar  sacandsTT  ttarafu  daateaa  ara  atailar. 

Pifurs  9.2  consists  of  tva  iatsraetiae  IPCs.  Ona  af 
those  aadals  is  •  fault  traasaetian  that  cauaoa  tha 
daviea  to  fail  sad  raeavar.  The  atbar  aadals  tha 
handline  of  coaputatlon  transsetions  durlae  hath  saraal 
and  fsiiurs  aedes. 

Tha  fault  traasaetiaa  waits  for  saaa  TBP  aad  thus  oats  a 
•lahal  fla«  at  nada  SRPU,  asaurise  that  sll  eoaputs* 
tion  trsnssetioaa  that  srriws  st  tha  dawlca  durlsf 
failure  will  bppsaa  it.  last  tha  fsilura  trsnasatiau 
uses  tha  SCTTUK  aada  ta  sat  tha  pawar  af  nada  0CT1  ta 
ba  s  wary  iarpa  naubar.  ensuring  that  all  tha  eauputa* 
tion  traasactiaas  waiting  or  raealwlng  sameo  at  OSVt 
will  leave  DGVi  inaadiataly.  Piaslly,  the  fault  trans- 
set-.on  viil  set  DBTl's  power  to  tare  so  as  to  block  sll 
suS9a<tuent  raquasts.  After  i  delay  of  tone  TTX  tha 
fault  transaction  will  sot  tha  power  of  0BT1  back  ta  < 
>orsal  opera  tion)  aod  rasot  tha  global  flag. 

Conputstion  transaatlona  go  ta  nodao  0101  ta  hswa  their 
phase  aodifiad  in  case  af  failure.  Ourlsg  namsl  opera* 
tion  tha  failure  flag  has  bean  raaat  so  tha  traassatiou 
will  visit  OtTI.  During  tha  fsilura  af  0IT1.  tha 
trsnsaetians  will  bypass  ths  DfTI  aada  and  laswa  tha 
eonplsx  aada  DISCI  with  a  traasaetian  phssa  that  inpliaa 
failure.  Sinilarly  C1D2  handles  the  flushed  trsnasc- 
tiona  that  roach  it  iaaodistaly  sftar  a  failure  aeonrs. 


Wa  finally  address  tha  notion  of  eanputation  failures  at 
ths  hoot  lawal.  Is  Pigura  4.9  wa  daaeribed  hew  each 
hast  paaaaaaad  a  aisgls  transaction.  Purther  this 
trunoaetion  inataatiated  SIBLIBC  eoaputatien  trsnaac- 
tiena  ta  axaeuto  ooaputatieaa  on  behalf  of  tha  host 
uaara.  This  approach  pays  off  whan  aodaling  failures 
baeause  tha  host's  transaction  can,  whan  the  host  fails, 
bo  nado  to  interrupt  all  of  its  siblings.  These 
siblings  nay  be  anywhere  in  the  CSX  network  at  that 
instant  of  tine.  Their  iatarruption  inpliaa  that  all 
eenpatation  at  that  host  antonatleally  eecsas. 

Me  illustrate  this  is  Pigura  9.9.  Kara  the  host  user 
traaanatian  IPC  is  net  shewn  as  in  Pigura  4. a.  In 
Pigura  9.9  the  failure  transaction  for  a  host  first 
waits  far  sona  iatarwal  before  failure  occurs.  At 
failure  the  falluro  transactian  enters  the  nutusl 
ascluaian  region  bausdad  by  XCCST  aad  xeRGL.  On 
anteriag  this  region  it  has  sutuslly  euelusiwa  aceeas  to 
the  hast  transactian  af  the  hast  that  oust  fail.  It 
thaa  sets  its  identity  in  global  variables  at  node 
3STPAXX  transaction,  iatsrrupts  tha  host  transaction  st 
KSTBUI  sad  than  waits  at  nods  tfTIHt. 


Plgnse  9.9.  Bsot  PUilnse  Xodal 

Ths  hast  transaction  is  intorruptsd  and  gats  tha  idon* 
tity  of  its  interrupter  at  C6TID.  and  then  at  IXTSIB 
iatarrupts  all  its  siblings  in  CSX.  Botica  that  this 
tine  it  does  aot  visit  CBBCOXP  since  it  is  net  instsa* 
tiatiag  a  new  eonpatatien.  After  Lntampting  all  its 
siblingo,  the  hoot  transaction  sets  tho  fsilura  flag  tnd 
thaa  intsrrupts  the  fsilura  transaction  at  node  WTIXX. 
The  failure  traascetien  exits  the  eutual  exclusion 
region  sad  than  antsrs  tha  TTI  delay  node.  The  recovery 
fron  failure  slnply  rsquiros  that  the  failure  flag  be 
raaat.  If  tha  host  traasaetian  is  intarmptad  by  a  host 
user  tmasoetioa  while  tho  failure  flog  is  set.  tha  host 
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tranaaetlott  latampta  tha  boat  aaar  kraaaaatloa  aa4 
iadleataa  that  tha  eeaputatlan  baa  falla4.  fbts  baa 
baaa  ahoaa  ta  Flfara  abieb  4oaa  aot  IXlaatrata 
aoraal  oparatloa  of  tba  beat  traasaettea. 

;.2.  coranrsiCATiONs  rtiLURES 

Fallura  of  eoaaunieatloaa  affoeti  tha  atatle  patha 
bataoan  s'.uatart  in  tba  lystoa.  Thara  La  a  hiararehy  of 
faiiuraa  in  eoaauaicationa.  A  link  failura  iapLiaa  that 
aoao  aot  of  patha  bataaan  eluatara  fail.  Ua  call  thaaa 
patha  tba  link'a  dapaadaat  aot.  failura  of  a  fataaay 
boat  lapliaa  that  tba  link  dapaadaat  aata  of  all  tha 
littka  eeanaetad  ta  that  (ataaay  fall,  ftaallp  a  eluatar 
failara  iaplloa  that  all  tha  llak  dapoadaiat  aata  of  all 
tha  liaka  eeaaaetad  ta  all  tha  cataaapa  of  that  eluatar 
ha*o  failed. 

Wo  daflao  four  data  atmeturaa  to  hold  tha  eluatar 
ittforuatiaa.  tha  link  iaforaatioa.  tha  path  laforhatloa 
and  tha  eurraat  eoaaaatirltp  laforaatiaa.  All  patha 
hatoaan  any  too  eluatara  aro  ordarad  by  laeraaalna 
lancth.  Tha  eurraat  ahertaat  path  bataaan  too  eluatara 
ta  paiatad  to  by  tha  eaaaaativity  aatrlx. 

faulta  la  tha  eooaualetion  ayataa  affaat  tha  load  dla> 
taneo  liaka  or  tha  laaal  araa  aataarka.  Thaaa  aaaaa 
paekat  tranaaetioae  to  bo  fluahad  out  of  thaaa  raooureoa 
aa  ia  tha  aedala  for  daviea  failura.  Thua,  tha  oapaadad 
IPGa  for  tha  ULa  aad  UUB  alll  iaeluda  aodaa  that  faraa 
paekat  traaaaetioaa  to  bypaaa  tha  raaaurea  la  eaaa  of 
failura.  Tldura  4.9  la  oitaadad  ta  raaord  traaaaiaaloa 
failure  aad  tha  paekat  ia  ra«traaaaittad  aftar  aaaa 
dalay  by  aaathar  rauta. 

Tha  PAWS  VSn  aado  latarfaaa  to  POITtAI  aaa  uaad  ta 
nodal  eoaauaieatloa  failure  la  order  to  aaalpulata  the 
dau  atrueturao  oaaily.  Tha  dotaila  of  thia  nodal  have 
eurrontly  baaa  eaaplotod  at  HA  aad  alll  appear  ia  a 
future  report. 

s.  eopciusioira 

Wo  have  praaantad  a  praetieal  eadalla«  aothodalogy  for 
diatrlbutad  syataaa  oaeoapaaaina  aadala  of  fault  prapa- 
cation  aad  fault  raaarory  atratoclaa.  PAWS  profraao 
uainc  thaaa  aadala  have  baaa  laplanaatod  aad  oaaautad. 
Tha  aajar  adrantacaa  of  our  aathadalocy  are  t)  It  la 
praetieal  aad  uaabla  by  praetltiaaara  today,  aad  2)  It 
U  btararebieal,  thua  (•omlttlna  aaay  aadlfleatloa  of  a 
eeapiat  nodal  and  pomlttinc  the  aadolUic  af  a  eaaplaa 
ayataa  ta  preeaad  la  eanliMtloa  alth  the  dealt  of  that 
ayaton. 

Tb»  tPC  naebanlaa  la,  aa  >o  ba«o  nontlenod  oarllar.  a 
li'tributod  prokrnanin*  prinltlva.  Thla  prlnltl**  la 
jaad  ay  itoaie  databaaa  tranaaetlona  la  tho  ZEUS  ayataa 
ioaikb.  Tho  aodola  for  atonie  aetiona  alll  thua  uaa  tho 
nodoia  «o  nave  dooalopad  la  thia  paper. 

Wa  have  alao  proaontad  In  thia  popor  aaoa  Ispartant 
nedollnc  toehaiquoa  uainc  PAWS.  One  of  than  ia  a  olaar 
iatuitiro  nodal  of  a  CSHA/CO  aataork.  Another  ia  an 
offoetiro  naana  of  intocratlnc  fault  nodollnc  *lth 
parfornanea  nadolinc. 
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MISSION 

of 

Rome  Air  Development  Center 


KADC  plans  and  executes  research,  development,  test  and  selected 
acquisition  programs  in  support  of  Command,  Control,  Communications 
and  Intelligence  ( C^I)  activities.  Technical  and  engineering  support  within 
areas  of  competence  is  provided  to  ESD  Program  Offices  (POs)  and  other 
ESD  elements  to  perform  effective  acquisition  of  C^I  systems.  The  areas 
of  technical  competence  include  communications,  command  and  control, 
battle  management,  information  processing,  surveillance  sensors, 
intelligence  data  collection  and  handling,  solid  state  sciences, 
electromagnetics,  and  propagation,  and  electronic,  maintainability,  and 
compatibility. 


